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Preface 



This volume contains the Proceedings of ICFCA 2004, the 2nd International 
Conference on Formal Concept Analysis. The ICFCA conference series aims to 
be the premier forum for the publication of advances in applied lattice and order 
theory and in particular scientific advances related to formal concept analysis. 

Formal concept analysis emerged in the 1980s from efforts to restructure 
lattice theory to promote better communication between lattice theorists and 
potential users of lattice theory. Since then, the field has developed into a growing 
research area in its own right with a thriving theoretical community and an 
increasing number of applications in data and knowledge processing including 
data visualization, information retrieval, machine learning, data analysis and 
knowledge management. 

In terms of theory, formal concept analysis has been extended into attribute 
exploration, Boolean judgment, contextual logic and so on to create a powerful 
general framework for knowledge representation and reasoning. This conference 
aims to unify theoretical and applied practitioners who use formal concept anal- 
ysis, drawing on the fields of mathematics, computer and library sciences and 
software engineering. The theme of the 2004 conference was ‘Concept Lattices” 
to acknowledge the colloquial term used for the line diagrams that appear in 
almost every paper in this volume. 

ICFCA 2004 included tutorial sessions, demonstrating the practical benefits 
of formal concept analysis, and highlighted developments in the foundational 
theory and standards. The conference showcased the increasing variety of formal 
concept analysis software and included eight invited lectures from distinguished 
speakers in the field. Seven of the eight invited speakers submitted accompanying 
papers and these were reviewed and appear in this volume. 

All regular papers appearing in this volume were refereed by at least two 
referees. In almost all cases three (or more) referee reports were returned. Long 
papers of approximately 14 pages represent substantial results deserving addi- 
tional space based on the recommendations of reviewers. The final decision to 
accept the papers (as long, short or at all) was arbitrated by the Program Chair 
based on the referee reports. As Program Chair, I wish to thank the Program 
Committee and the additional reviewers for their involvement which ensured 
the high scientific quality of these proceedings. I also wish to particularly thank 
Prof. Paul Compton and Rudolf Wille, Dr. Richard Cole and Peter Becker for 
their support and enthusiasm. 



January 2004 



Peter Eklund 
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and Generalized Double Boolean Algebras 



Rudolf Wille 

Technische Universitat Darmstadt, Fachbereich Mathematik 
Schlofigartenstr. 7, D-64289 Darmstadt 
wille@mathematik . tu-darmstadt . de 



Abstract. Boolean Concept Logic as an integrated generalization of 
Contextual Object Logic and Contextual Attribute Logic can be substan- 
tially developed on the basis of preconcept algebras. The main results 
reported in this paper are the Basic Theorem on Preconcept Algebras 
and the Theorem characterizing the equational class generated by all 
preconcept algebras by the equational axioms of the generalized double 
Boolean algebras. 



1 Preconcepts in Boolean Concept Logic 

Concepts are the basic units of thought wherefore a concept-oriented mathemati- 
cal logic is of great interest. G. Boole has offered the most influential foundation 
for such a logic which is based on a general conception of signs representing 
classes of objects from a given universe of discourse [Bo54] . In the language of 
Formal Concept Analysis [GW99a], Boole’s basic notions can be explicated: 

— for a “universe of discourse” , by the notion of a “formal context' defined as 
a mathematical structure IK := (G, M, I) where G is a set whose elements 
are called “objects" , M is a set whose elements are called “attributes" , and 
/ is a subset of G x M for which glm (i.e. ( g , m ) £ I) is read: the object g 
has the attribute to, 

— for a “sign” , by the notion of an “attribute" of a formal context and, 

— for a “class”, by the notion of an “extent" defined in a formal context IK := 
(G, M, I) as a subset Y' := {g £ G | Vm £ Y : glm } for some Y C M . 

How Boole’s logic of signs and classes may be developed as a Contextual At- 
tribute Logic by means of Formal Concept Analysis is outlined in [GW99b]. The 
dual Contextual Object Logic , which is for instance used to determine concep- 
tual contents of information [Wi03a] , can be obtained from Contextual Attribute 
Logic by interchanging the role of objects and attributes so that, in particular, 
the notion of an “extent” is replaced by 

— the notion of an “intent" defined in a formal context IK := (G, M, I) as a 
subset X' := {to £ M \ \/g € X : glm} for some X C G. 

P. Eklund (Ed.): ICFCA 2004, LNAI 2961, pp. 1-13, 2004. 
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Since a concept, as a unit of thought, combines an extension consisting of ob- 
jects and an intension consisting of attributes (properties, meanings) (cf. [Sch90], 
p.83ff.), a concept-oriented mathematical logic should be an integrated gener- 
alization of a mathematical attribute logic and a mathematical object logic. In 
our contextual approach based on Formal Concept Analysis, such an integrated 
generalization can be founded on 

— the notion of a “formal concept ’ defined, in a formal context IK := (G, M, I), 
as a pair (A, B) with A C G, B C M, A = B', and B = A! [Wi82], 

and its generalizations: 

— the notions of a “n -semiconcepf (A, A’) with A C G and a “U -semiconcepf 
(B ' , B) with B C M [LW91], the notion of a “ protoconcept ’ (A,B) with 
A C G, B C M, and A" = B' (<f=> B" = A') [WiOOa], and the notion of a 
“preconcepf (A, B) with A C G, B C M, and A C B' (•<=> B = A') [SW86]. 

Clearly, formal concepts are always semiconcepts, semiconcepts are always 
protoconcepts, and protoconcepts are always preconcepts. Since, for A C G and 
Y C M, we always have X"' — X' and Y'" = Y', formal concepts can in general 
be constructed by forming (A", A') or (Y',Y"). The basic logical derivations 
X i— > X' and Y Y' may be naturally generalized to the conceptual level by 

— (A, Y) i — ► (X,X') ^ (X",X') and (X, Y) i-> (Y’,Y) ^ (Y’,Y") 
for an arbitrary preconcept (A, Y) of IK := (G, M, /). 

It is relevant to assume that (A, Y) is a preconcept because otherwise we would 
obtain Y % X' and A <2 Y', i.e., (A, A') and (Y',Y) would not be extensions 
of (A, Y) with respect to the order C 2 defined by 



- (*i,n) C 2 (A 2 ,y 2 ) : <=» A, C A 2 and Y 1 C Y 2 . 



Notice that, in the ordered set (HJ(IK), C 2 ) of all preconcepts of a formal context 
IK := (G, M, /), the formal concepts of IK are exactly the maximal elements 
and the protoconcepts of IK are just the elements which are below exactly one 
maximal element (formal concept). 

For contextually developing a Boolean Concept Logic as an integrated gen- 
eralization of the Contextual Object Logic and the Contextual Attribute Logic, 
Boolean operations have to be introduced on the set 3J(K) of all preconcepts 
of a formal context IK := (G, M, /). That shall be done in the same way as for 
semiconcepts [LW91] and for protoconcepts [WiOOa]: 

(Ai, Bi) n (A 2 , B 2 ) := (Ai n A 2 , (Ai fl A 2 )') 

(Ai, B\) LI (A 2 , Bf) := ((Bi fl B 2) 1 , B\ fl S 2 ) 

“■(A, B) := (G \ A, (G \ A)') 

— ‘(A, B) := ((M\B)\M\B) 

X := (0, M) 

T := (G, 0) 
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The set QJ(K) together with the operations n, U, -*,■ - 1 , _L, and T is called the pre- 
concept algebra of IK and is denoted by QJ(K); the operations are named “meet”, 
“join”, “negation” , “ opposition ”, “ nothing ”, and “all”. For the structural anal- 
ysis of the preconcept algebra fZJ(IK) , it is useful to define additional operations 
on 33(K): 

aUb := — 1 (— ia n -nb) and aFlb := - l (- l a U — 'b), 

T := ^_L and I := -T. 

The semiconcepts resp. protoconcepts of IK form subalgebras fo(K) resp. tp(IK) 
of QJ(IK) which are called the semiconcept algebra resp. protoconcept algebra of 
IK. The set f)n(IK) of all n-semiconcepts is closed under the operations n, U, 
-i, T, and T; therefore, }j n (IK) := }fj n (IK), n, U, -i, _L, T) is a Boolean algebra 
isomorphic to the Boolean algebra of all subsets of G. Dually, the set L)u(IK) 
of all U-semiconcepts is closed under the operations FI, U, - 1 , =L, and T ; there- 
fore, ^ U (IK) := (£ U (IK),F1, U, - 1 , i, T) is a Boolean algebra antiisomorphic to the 
Boolean algebra of all subsets of M. Furthermore, IB (IK) = fin (IK) Df)u(IK), and 
(I8(IK),A,V) is the so-called concept lattice of IK with the operations A and V 
induced by the operations n and U, respectively. The general order relation C 
of tp (IK) , which coincides on IB (IK) with the subconcept-superconcept-orcler < , 
is defined by 



(Ai,I?i) U (^ 2 ,^ 2 ) :^=^* A\ C Ai and Bi D i? 2 - 

The introduced notions found a Boolean Concept Logic in which the Con- 
textual Object Logic and the Contextual Attribute Logic can be integrated by 
transforming any object sets X to the n-semiconcept (A, X') and any attribute 
set Y to the corresponding U-semiconcept (Y',Y). In the case of Contextual 
Attribute Logic [GW99b], this integration comprises a transformation of the 
Boolean compositions of attributes which is generated by the following elemen- 
tary assignments: 

mAtiH ({to}', {m}) n ({n}', {n}), 
to V n ({to}', {m})U({n}', {n}), 

->m i— > -i({m} / , {to}). 

In the dual case of Contextual Object Logic, the corresponding transformation 
uses the operations FI, U, and - 1 . 

Preconcept algebras can be illustrated by line diagrams which shall be demon- 
strated by using the small formal context in Fig.l. The line diagram of the pre- 
concept algebra of that formal context is shown in Fig. 2: the formal concepts 
are represented by the black circles, the proper n-semiconcepts by the circles 
with only a black lower half, the proper U-semiconcepts by the circles with only 
a black upper half, and the proper preconcepts (which are even not protocon- 
cepts) by the unblackened circles. An object (attribute) belongs to a preconcept 
if and only if its name is attached to a circle representing a subpreconcept (su- 
perpreconcept) of that preconcept. The regularity of the line diagram in Fig. 2 
has a general reason which becomes clear by the following proposition: 
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male 


female 


old 


young 


father 


X 




X 




mother 




X 


X 




son 


X 






X 


daughter 




X 




X 



Fig. 1. A context K-^ of family members 




Fig. 2. Line diagram of the preconcept algebra of the formal context in Fig.l 
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Proposition 1 For a formal context K := ( G , M, /), the ordered set (QJ(IK), Z) 
is a completely distributive complete lattice, which is isomorphic to the concept 
lattice of the formal context V(K) := ( GUM , GUM, I U(^\G x M)). 

Proof For ( A t ,B t ) £ 2J(IK) (t £ T ), we obviously have 

inft£ T {A t ,B t ) = <n a <’ u B t ) and sup teT (A t , B t ) = ( U ^ n * 0 ; 

teT teT teT ter 

hence QJ(K) is a complete sublattice of the completely distributive complete 
lattice (fp(G), C) x (fp(M), Z) which proves the first assertion. For proving the 
second assertion, we consider the assignment (A, B) A (AU(Af \f?), (G\ A)UB). 
It can be easily checked that ( AU(M\B ), (G\A)Uf?) is a formal concept of V(K). 
Let (G, D) be an arbitrary formal concept of V(K). Obviously, C = (GflG)U(Af\ 
D ), D = (G\G)U(DfW), and (GfiG, DCiM) is a preconcept of K. Therefore we 
obtain (GflG, DPiM) A (G, D). Thus, i is a bijection from QJ(K) onto fB(V(K)). 
Since (Ai, B\) Z (A 2 , B 2 ) X ~ ' z A\ Z A 2 and Bi Z B 2 > A\ U (Af \ B\) C 
A2U(M\i?2) •<=>• (AiU(M\i?i), (G\Ai)Ui?i) < (A2U(Af\f?2), (G\A2)Ui?2), 

the bijection l is even an isomorphism from T[(IK) onto *8(V(IK)). □ 

2 The Basic Theorem on Preconcept Algebras 

A detailed understanding of the structure of preconcept algebras is basic for 
Boolean Concept Logic. To start the necessary structure analysis, we first deter- 
mine basic equations valid in all preconcept algebras and study abstractly the 
class of all algebras satisfying those equations. The aim is to prove a charac- 
terization of preconcept algebras analogously to the Basic Theorem on Concept 
Lattices [Wi82]. 

Proposition 2 In a preconcept algebra QJ(IK) the following equations are valid: 



la) (1 n 1 ) n y = x Fly 


lb) ( 1 U 1 ) U y = x U y 


2 a) 


x\ly = y n x 


2b) 


x U y = y U x 


3a) 


x n (y n z) ~ (x n y) n z 


3b) 


iU(j/Uz) = (x U y) U 2 


fa) 


x n (xU y) = xtl x 


4b) 


x U (x n y) = x U x 


5a) 


x n (xUy) = x n x 


5b) 


x U ( xr\y ) = xUx 


6a) 


x n (yU z) = {x n y)U(a: n z) 


6b) 


x U (z/FZ) =(iU y)f = \{x U z) 


7a) 


-'-'(x n y) = x n y 


7b) 


- l - l ( x U y) = x U y 


8a) 


-'(x n x) = -ix 


8b) 


-‘(x Ui) = -‘x 


9a) 


x n ->x = t 


9b) 


x U -‘x = T 


10 a) 


A=TnT 


10b) 


- J T = 1 u 1 


11a) 


= _L 


lib) 


-T = T 


12a) 


^nun — ^nu 


12b) 


*unu = ®un 



where tn '■= t n t and t u :=tUt is defined for every term t. 



Proof The equations of the proposition can be easily verified in preconcept 
algebras. This shall only be demonstrated by proving 4a) and 12a): (A, B) n 
((A,B)U(C,D)) = (A,5)n((BnD)',BnD) = (An(BnD)',(An(BnT)')') = 
(A, A') = (A, B) n (A, B) and (A, f?)nun = (A,A') un = (A",A') n = (A”, A') = 
(A, A') u = (A, 5)nu- LI 
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An algebra D_ := (D, n, U, - 1 , J_, T) of type (2, 2, 1, 1, 0, 0) is called a weak 
double Boolean algebra if it satisfies the equations la) to 11a) and lb) to lib) 
and a generalized double Boolean algebra if it satisfies the equations la) to 12a) 
and lb) to 12b) of Proposition 2; if, in addition, it even satisfies the equation 12) 
Znu = £un, D. is called a double Boolean algebra [HLSW00]. For weak double 
Boolean algebras, further operations are defined as in the case of preconcept 
algebras: 

x\Jy := — 1 (— iar n -> y) and xf\ y := — '( — 'cc U ~ > y), 

T := ->T and =L := -T, 
x n :=illi and x u := x U x. 

By Proposition 2, each preconcept algebra is a generalized double Boolean al- 
gebra. Protoconcept algebras are double Boolean algebras. Semiconcepts alge- 
bras satisfy the additional condition 13) xl~\x = x or x\Jx = x. A weak 
double Boolean algebra D satisfying the condition 13) is called pure , because 
it is only the union of the two subsets D n := {x £ D \ x n x = x} and 
D u := {x £ D | x U x = x} which both carry a Boolean structure, i.e., 
D_ n ■— (D n , n, U, _L, T) and D_ u := {D Ul R, U, - 1 , =L, T) are Boolean algebras. 
As on preconcept algebras, a quasiorder is defined on weak double Boolean al- 
gebras by 

x C y : •<=>• x n y = x n x and x U y = y U y. 

A weak double Boolean algebra D_ is said to be complete if the Boolean 
algebras D n and D , are complete. The existing infimum resp. supremum of a 
subset A of D n are denoted by n A resp. LJ A and, dually, of a subset B of D u 
by n B resp. 1J B. In general, it is appropriate to define the arbitrary meet and 

join in D by P C := ri{c n | c £ C} and \_\C := | |{c u | c £ C} for arbitrary 

subsets C oi D. Clearly, preconcept algebras are examples of complete double 
Boolean algebras. 

In [HLSW00], weak double Boolean algebras D := ( D , n, U, - 3 J _L, T) with 

the quasiorder C defined by x Q y : <J==4> (x n y = x n x and x U y = y U y) are 

characterized as quasiordered sets (D, C) satisfying the following conditions: 

1. ( D , C) has a smallest element 0 and a largest element 1; 

2. there is a subset D n of D so that {D n , C) is a Boolean algebra whose oper- 
ations coincide with the operations n, U, _L, and T of Zl; 

3. there is a subset D u of D so that {D u , C) is a Boolean algebra whose oper- 
ations coincide with the operations R, U, - 1 , =L, and T of D_\ 

4. for x G D, there exists x^ € D n with y Q x n y Qx for all y £ D n - 

5. for x £ D, there exists x u £ D u with y A x u •€=>• y A x for all y £ D u \ 

6. x C y a: n E Vn and £ U E Vu- 

The links between the algebraically and orcler-theoretically given operations are 
established by the following equations: x n y = x n n |/n, x U y = Xu U yu, 
-ix = ~i(x n ), -‘x = - J (x u ), -L = 0, and T = 1. Therefore, for an order-theoretic 
characterization of generalized double Boolean algebras one has to add the fur- 
ther condition 

7- £nun = ®nu and Junu = lun- 
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For formulating the Basic Theorem on Preconcept Algebras , we adapt notions 
which have been used for proving the corresponding Basic Theorem on Proto- 
concept Algebras [VW03]: A weak double Boolean algebra D_ is called contextual 
if its quasiorder C is antisymmetric. A contextual weak double Boolean algebra 
D_ is said to be totally contextual if, in addition, for each x £ D_ n and y £ D_ u 
with x u E y n there is a unique z £ ID with z n = x and z u = y. 

Theorem 1 (The Basic Theorem on Preconcept Algebras) For a context 
IK := (G, M, I), the preconcept algebra 2J(K) of IK is a complete totally contextual 
generalized double Boolean algebra whose Boolean algebras f) n (IK) and £) u (IK) are 
atomic. The ( arbitrary ) meet and join o/2J(IK) are given by 

n T (A t ,B t ) = (f| A t ,(f| At)') and \J(A t ,B t ) = ((f) B t )' , f] B t ). 

teT teT teT teT teT 

In general, a complete totally contextual generalized double Boolean algebra D_ 
whose Boolean algebras D_ n and D_ u are atomic, is isomorphic to 22(K) if and 
only if there exist a bijection 7 from G onto the set A(D_ n ) °f all atoms of D. n 
and a bijection fi from M onto the set C(D_ U ) of all coatoms of ID u such that 
glm <==$■ 7 (g) E /2(m) for all g £ G and m £ M . In particular, for any 
complete totally contextual generalized double Boolean algebra ID whose Boolean 
algebras are atomic, we get D_ = %J(A(Il n ),C(I2 u ), E)< be., the preconcept al- 
gebras are up to isomorphism the complete totally contextual generalized double 
Boolean algebras D [ whose Boolean algebras D_ n and ID U are atomic. 

Proof Using Proposition 2, it is straightforward to check that every preconcept 
algebra QJ(K) is a complete contextual generalized double Boolean algebra whose 
Boolean algebras f) n (IK) and £) U (IK) are atomic. For semiconcepts (A, A') and 
(B',B) with (A", A') = (A, , B) n = {B',B"), ( A,B ) is the unique 

preconcept with (A, B) n = {A, A') and {A,B) U = ( B',B ); hence ®(IK) is even 
totally contextual. Because of (A,B) n = {A, A') and (A,B) U = (B',B) for each 
preconcept (A,B), we obtain 

Q, (A t ,B t ) = inf ft n (k) { ( A* , A' t ) | t e T} = (f| A t) (f| A t )’), 

ter ter 

U (A t , B t ) = sup ^ u{K) m,B t ) | t £ T} = ((f| B t y, f| B t ). 

t£T teT t£T 

Now, let ip : QJ(K) — » D_ be an isomorphism. Then the desired bijections are 
defined by 7 (g) := ip{{g},{g}') and fi(m) := <p({m}' , {m}). Since A(^ n (K)) = 
{({.9>, {5}') I 9 € G} and C(^ U (IK)) = {({m}',{m}) | m £ M}, it follows 
A(i2n(K)) = (7(3) | g £ G} and C(^ U (IK)) = {fi(m) \ m £ M}. Thus, 
7 is indeed a bijection from G onto A(D_ n ) and jl a bijection from M onto 
C(Djj). Furthermore, glm ■<=>■ ( g £ {m}' and m £ {g}') ({3}, {g}') E 

({to}', {m}) ■$=> 7 (g) E p{m) for all g £ G and m £ M. 

Conversely, we assume the existence of the bijections 7 and fi with the 
required properties. Then we define two maps <pn : £) n (IK) — * J2n an( l Tu '■ 
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->• Hu b y *Pn(A A') := Ujy^) | g G A} and p u (B',B) := ri{/i(m) | 
to € B}. Since D_ n and D_ u are complete atomic Boolean algebras, pn and p u 
are isomorphisms onto those Boolean algebras. For an arbitrary n-semiconcept 
( A , A') of K, let x := Pn(A, A') and y := pu{A " , .A'). Condition 4 (listed above) 
yields that x u = G Z2u I £ E &} = Fife G C(^ U (IK)) | x E c} = y because 
of the equivalence glm <f=> 7(g) E /2(m). Thus, 



(<Pn(A, A , ))u — </7 j((A, -A , )ij) and (y>u(.B\ .B))n — < ^n((-B / , -B)n) 

because, for a U-semiconcept (B\B) of K, we obtain dually y n = x if y := 
^□(IT, B) and x := <^ n (-B / , 5"). If (A, B) is even a formal concept of K, we have 
x = yn = x’un = inun = ®nu G D n fl D u by the equation 12a) in Proposition 
2 (in the analogous proof for semiconcept algebras in [VW03] we could just 
directly apply the equation 12). Now it follows that pn {A, B) = x = x u = y = 
ip u {A,B). Thus, ip n and <p u coincide on *8(K)(= ij(K)n nij(K) u ) and therefore 
p(A,A') := ipr \{A,A') for ACG and p(B',B) := p u (B',B) for B C M defines 
a bijection p from .ft(K) onto D . 

Since D_ is totally contextual, p can be uniquely extended to a bijection p 
from 2J(K) onto Z? : For a preconcept ( A , B) which is not a semiconcept and for 
x := <p(A,A') and y := p{B',B), we obtain x u E yn', hence there is a unique 
z(A, B) G D_ with z(A, B) n = x and z(A, B) u = y. Thus, we can indeed extend 
p to a bijection p : 2J(K) -4 D_ with ip(A,B) = z(A,B) for all preconcepts 
(A, B) which are not semiconcepts. 

tp preserves the operations of generalized double Boolean algebras which can 
be seen as follows: Since pn and p u are isomorphisms between the corresponding 
Boolean algebras and since (p{A,B))n = pn (A, A') = p((A, B) n ), we get 

P I I teT^At, B t ) = pn I I teT(At, B t )n — I I ter Pn((A t , B t ) n) 

= I \teT(p(A t , B t ))n — I \t£T(p(A tl B t ))’, 
pH(A, B)) = <£n( _, ((A, B) n )) = ->(pn{{A, B ) n )) 

= ~ , ((<p( A : B )) n ) = ~{p{A, B))- 

<£(0, M) = y> n (0, M) = T. 



Dually, we obtain that p preserves joins | |, opposition - 1 , and the top element T. 

Finally, let D_ be a complete totally contextual generalized double Boolean 
algebra whose Boolean algebras are atomic, let 7 be the identity on A(D_ n ), and 
let /2 be the identity on C(JD U ). Then the already proved second part of the basic 
theorem yields directly the claimed isomorphy D_ = ?8{A(D_n),C( y D_ u ), E)- El 

The Basic Theorem on Preconcept Algebras may be used to check the cor- 
rectness of a line diagram representation of a preconcept algebra. How this is 
performed shall be demonstrated by the formal context ICA := (G? ,M? ,P) in 
Fig.l and its corresponding line diagram in Fig. 2. In this figure, 
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— the preconcept algebra 2J(KE) is first of all drawn as an ordered set (V, E) 
with 0 and 1; 

— the 16 circles with black lower half represent a Boolean algebra (Vh, E) which 
is isomorphic to £) n (IfC); 

— the 16 circles with black upper half represent a Boolean algebra (VJj,E) 
which is isomorphic to £) U (K^); 

— the conditions 4., 5., 6., and 7. of the above order-theoretic characterization 
of generalized double Boolean algebras are satisfied; 

— the bijection 7 is indicated by the attachment of the object names to the 
represented atoms of the Boolean algebra b n (IK^); 

— the bijection fi is indicated by the attachment of the attribute names to the 
represented coatoms of the Boolean algebra fo u (KE); 

— finally, there is an ascending path of line segments from a circle with an 
object label to a circle with an attribute label if and only if the object has 
the attribute according to the formal context IK A 

After checking all of this, we know by the Basic Theorem on Preconcept Alge- 
bras that the labelled line diagram in Fig. 2 correctly represents the semiconcept 
algebra of the formal context given in Fig.l. An alternative for checking 
the correctness of a line diagram representation of a preconcept algebra can 
be based on Proposition 1 and the Basic Theorem on Concept Lattices (see 
[GW99a], p.20). 



3 The Equational Class 

Generated by all Preconcept Algebras 



An important question is whether the equations of Proposition 2 are enough 
for determining the equational class generated by all preconcept algebras, i.e. , 
whether each equation valid in all preconcept algebras can be entailed by the 
equations of Proposition 2. This can indeed be proved, but it needs further 
investigations of generalized double Boolean algebras. The following four lemmas 
are adapted from the analogous investigations of double Boolean algebras in 
[WiOOa]. 

Lemma 1 In a weak double Boolean algebra the following conditions are satis- 
fied: 

(lJxHyQxQxUy, 

(2) the mapping x — > x n y preserves E and n, 

( 3) the mapping x — > x U y preserves E and U . 

Proof (1): 2a), 3a), and la) yield (a: n y) n x = y n (x n x) = y n (y n (x n x)) = 
(xr\y)r\(xV~\y). By 2b) and 4b) it follows {xUy)Ux = xUx. Hence xUy E x. Dually, 
we obtain x E x U y. (2) and (3) follow straight forward. Let us only mention 
that x E y implies (x n z) U (y n z) = (x n x n z) U (y n z) = {xUy\~\ z)U{y\l z) = 
(y fl z) U (y fl z) by 4b). □ 
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For proving that each generalized double Boolean algebra can be quasi- 
embedded into the preconcept algebra of a suitable context, the notion of a 
filter and an ideal of a weak double Boolean algebra D_ is defined: A subset F of 
D is called a filter if x £ F and x □ v in D imply y £ F and if a : £ F and y £ F 
imply x n y £ F; an ideal of D_ is dually defined. A subset Fq is called a base of 
the filter F \i F = {y £ D \ x y ion some x £ Fq}: again, a base of an ideal is 
dually defined. 

Lemma 2 Let F be a filter of a weak double Boolean algebra D 

(1) F fl D n and F fl D u are filters of the Boolean algebras D_ n and _D U , 
respectively. 

(2) Each filter of the Boolean algebra jD n is a base of some filter of D_; 
in particular, F fl D n is a base of F. 

Proof Since the restrictions of n and C to D n are the meet operation and the 
order relation of the Boolean algebra F3 n , F fl D n is obviously a filter of D_ n . 
F D D u is a filter of the Boolean algebra Z? u because x n y C xf\ y for arbitrary 
x,y £ D u , namely x n y is a lower bound of x and y by Lemma 1(1) and xf\y 
is the greatest lower bound of x and y since the restriction of C to D u is the 
order of the Boolean algebra D_n . Now, let E be a filter of Z?n • For x\ C y\ and 
x-i C j /2 with Xi,X 2 £ E, we obtain x\ n x -2 C X\ n 3/2 C 2 /i n 3/2 by Lemma 1(2); 
hence {y £ D \ x C y for some x £ E} is a filter in D_ with E as a base. For 
y £ F we have that yV~\y £ FC\ D n and y n y C y by Lemma 1(1). Thus, F fl D n 
is a base of F. □ 

Let $p(D_) be the set of all filters F of the weak double Boolean algebra D_ for 
which F fl Dn is a prime filter of the Boolean algebra D n , and let 3 P (D_) be the 
set of all ideals I of D_ for which / fl D u is a prime ideal of the Boolean algebra 
D u (for the definitions and results concerning prime filters and prime ideals see 
[DP92], p.l85ff.). Now, we define the standard context for a weak double Boolean 
algebra D by 

K(D):=($ p (D),3 p (D),A) 

where FAI :<$=> F fl I 0 for F £ $ P (D) and I £ 3 P {D). For x £ D, let 
Sir := {F £ $ P (D) | x £ F} and let 3 X := {I £ 3 P (D) \ x £ I}. 

Lemma 3 The derivations in K(D) yield: 

(V &x = = ?xu f or al1 x £ D n , 

(2) 7 y = Sy = $ yn for all y £ D u , 

Proof (1): Let x £ D n and let / £ 3 X . Then x £ F fl I for all F £ $ x ; hence 
I £ S' x . Conversely, let I £ ^' x . Suppose x ^ I. Then I D D n is an ideal of D n , 
by the dual of Lemma 2(1), and x £ D n \ I. The ideal I fl D n is contained in 
a prime ideal of D n the complement of which in D n is a prime filter E of D_n 
containing x. By Lemma 2(2), F := {y £ D \ w C y for some w £ E} is a filter 
of D_ with F fl Dn = E ; hence F £ $ x . But F fl I — 0 which contradicts I £ 
Thus, x £ I and so I £ 3 X . This proves $' x = 3 X . The corresponding equality in 
(2) follows dually. □ 
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Lemma 4 The relation □. on a generalized double Boolean algebra D. defined by 
xOy : •<=>• x Q y and y C x ( •<=>■ Xn = Vn and Xu = 2/u) has as restriction 
the identity on D p := D n U D u and is a congruence relation of D. 

Proof Let (x, y) £ □ . If x,y £ D n or x,y £ D u then, obviously, x — y. Now, 
let x £ D n and y £ D u . Then x = y n = i’un = 2/nun = 2/nu = x u = y; hence 
x = y. Thus □ n (D p ) 2 = Ido p - Clearly, □ is an equivalence relation on D 
which is even a congruence relation because the relationships diD&i, . . . , a n Ob n 
always imply (<Zj) n = (&») n and (eq)u = (6j)u for * = l,...,n and therefore 
t(ai, . . . , a n ) = t(bi, . . . , b n ) for each proper algebraic term t(x i, . . . , x n ). □ 

For formulating the next two lemmas adapted from [WiOOa] too, we need the 
following notion: a map a between quasi-ordered sets is called quasi-injective if 
it satisfies the equivalence x Q y •<=> a(x) E a(y). 

Lemma 5 Let D_ be a generalized double Boolean algebra. Then x i— > (^ x . 3 X ) 
describes a quasi-injective homomorphism i from D_ to QJ(1K(.D)) having E3 as 
kernel. 

Proof Let x £ D and let I £ 3 X . Since x £ F (~l I for all F £ 3x, we obtain 
I £ d ' x , he. 3 X Q 5'x- Thus, ($ X ,3 X ) is a preconcept of K(D) for all x £ D. For 
xn E Vn in Hn there exists always an F £ 3 P (D_) with x n £ F but y n £ F ; 
hence $ x = $ Xn E $yn = $y and so ($ X .3 X ) E (3y, 3 :</ ). Such inequality can be 
analogously obtained for i/ n ^n, x u % 2/u, and y u % x u . If in E Jn, Jn E ini 
and E 2/u, 2/u E £u, then x n = yn and x u = 2/u; hence (Jx,^) = ($ y ,3 y ). 
Therefore x i— > ($ X ,3 X ) describes a quasi-injective map l from D_ to QJ(1K(.D)) 
having Q as kernel. It is even a homomorphism because, besides = 3>(i2) and 
Jj_ — 3 P ( _[)_) . we can show that $ x ny — $x G ffy . 3 x u y — 3 X (33 y . $-, x — jfp \ iE , 
and 3~i x = 3 P (D_) \ 3 X . These equalities result from Lemma 3 and the following 
equivalences and their duals: F £ $ x n y x(3y £ F x, y £ F F £ $ x n$ y 
and F £ 3-, x <^> ->x £ F <^> -i(x n ) £F<=t>XnjfF<=>x£F<=i>F£ $ P (D) \ $ x . 

□ 

Lemma 6 Let D_ be a generalized double Boolean algebra. Then each map j3 : 
l(D) — > D satisfying = y is an injective homomorphism from the image of 
l into D_. 

Proof Let (3 be a map from l(D) into D satisfying i/3(y) = r. Obviously, (3 is 
injective and even bijective on l(D p ). Since the operations of D. always result 
in elements of D p , we obtain (3l{x) n (3i{y) = /3i(f3i(x) n (3i(y)) = /3(l(3l(x) n 
tf3i(y)) = (3 (l(x) n t(y)) and, analogously, the compatibility conditions for the 
other operations; hence f3 is an injective homomorphism. □ 

Now, we are prepared to prove that each equation valid in all preconcept 
algebras is a logical consequence of the equations in Proposition 2. This claim is 
the content of the following theorem: 

Theorem 2 The equational class generated by all preconcept algebras of formal 
contexts is the class of all generalized double Boolean algebras. 
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Proof Let D_ be a generalized double Boolean algebra. For a non-trivial equation 
, x n ) = t 2 (xi, . . . , x n ) valid in all preconcept algebras, Lemma 5 yields 
ti(i(ai) . . . , i(a n )) = t 2 (i(ai ), . . . , i(a n )) for ai, . . . ,a n € D and, by Lemma 6, 
even ti(oi, . . . , a n ) = fi(/3t(ai), . . . , /3i(a n )) = t 2 {jdi{a 1 ),...,^b{a n )) = f 2 (ai, 
. . . , a n ) for each map /? from i{D) to D with t/3(y) = y and /?t(aj) = a, for 
i = 1, ... ,n. With such a map (3, we obtain that 0*0°^ implies a * = /3 t(oq) = 
Pi(aj) = aj. In general, OjCJOj (I < j) causes the equality t^{a i, . . . , «j, . . . , a,,-, 
. . . , a n ) = tk(a i, . . . , ai, . . . , ai, . . . , a n ) for k = 1,2. All together show that the 
equation ti(xi, . . . , x n ) = t 2 (x i, . . . , £„) is valid in all generalized double Boolean 
algebras too. □ 

For a finite generalized double Boolean algebra I), the elements of 3 P (D.) 
are just the principal filters [a) := {a: £ D \ a C x} where a is an atom of the 
Boolean algebra L> n , and the elements of 3 P (D_) are just the principal ideals (c] := 
{x £ D | x C c} where c is a coatom of the Boolean algebra D u : furthermore, 
[a)A(c\ <^> a C c. Therefore, the formal context (A(D n ), C(D U ), C), for which 
A(D_ n ) is the set of all atoms of D_ n and C(D. U ) is the set of all coatoms of D_ u , 
can be viewed as a simplified version of the standard context of D_. Substituting 
this simplified version in Lemma 5 yields the following corollary: 

Corollary 1 Let D_ be a finite generalized double Boolean algebra. Then x i— > 
({a € A{D n ) | a C x},{c G C{D U ) \ x C c}) describes a quasi-injective ho- 
momorphism l from D_ to ty(A(]J n ), C(D_ Li ),Q) which maps D isomorphically 
onto SffA(D r f), C(f2_ u ), C). 
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Abstract. Protoconcept graphs are part of Contextual Judgment Logic. 
Generalizing the well-developed theory of concept graphs, they express 
judgments with a negation on the level of concepts and relations by rep- 
resenting information given in a power context family in a rhetorically 
structured way. The conceptual content of a protoconcept graph is un- 
derstood as the information which is represented in the graph directly, 
enlarged by the information deducible from it by protoconcept implica- 
tions of the power context family. The main result of this paper is that 
conceptual contents of protoconcept graphs of a given power context 
family can be derived as extents of the so-called conceptual information 
context of the power context family, thus a generalization of the Basic 
Theorem on K — Conceptual Contents in [Wi03]. 



1 Introduction 

The theory of protoconcept graphs is part of a program called ‘Contextual Logic’ 
which can be understood as a formalization of the traditional philosophical logic 
with its doctrine of concepts, judgments, and conclusions (cf. [WiOOb], [DK03]). 
Concepts, as basic units of thinking, are already formalized in Formal Concept 
Analysis (FCA) and its extensions ([GW99], [Wi02], [VW03]). Judgments are 
then understood as meaningful combinations of concepts, and with conclusions 
new judgments are inferred from already existing ones. 

A main goal of FCA from its very beginning has been the support of rational 
communication. This claim has been met by using diagrammatic representations 
to make complex data available: Information given in a formal context is repre- 
sented by a labelled line diagram, which allows even the unfamiliar user to un- 
derstand and analyze the inner structure of the data. Similarly, for the doctrine 
of judgments, a formalization would be desirable which is both mathematically 
precise and representable by diagrams which are easily readable. Moreover, in 
order to to make more complex information intelligible, it would be helpful if 
the diagrams could be structured rhetorically. 

A promising approach is to use the system of conceptual graphs by John Sowa 
(see [So84] , [So92] ) . These graphs are a diagrammatic system of logic whose pur- 
pose is ‘to express meaning in a form that is logically precise, humanly readable, 
and computationally tractable’ (see [So92] ) . The philosophical background of 

P. Eklund (Ed.): ICFCA 2004, LNAI 2961, pp. 14-27, 2004. 
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this system is similar to FCA, and the system allows us to formulate judgments 
and conclusions in a way which is much nearer to human reasoning than predi- 
cate logic (for a more thorough discussion of the intentions and the philosophical 
background of FCA and Conceptual Graphs we refer to [DK03]). However, the 
system of conceptual graphs is not elaborated mathematically. One reason for 
this is that conceptual graphs were designed to be of use in a wide variety of 
different fields (the most prominent being knowledge representation, software 
specification and modelling, information extraction and natural language pro- 
cessing), which lead to a broad range of modifications and extensions of the 
core theory. The system of conceptual graphs as a whole is huge, without sharp 
borders and contains several ambiguities. Thus, when making conceptual graphs 
mathematically explicit, only restricted parts of Sowa’s theory can be covered 
at once. 

The first approach for a mathematization was discussed in [Wi97], where 
so-called concept graphs were introduced semantically. They are defined with re- 
spect to a given family of formal contexts (called power context family) and con- 
tain information ranging over various of these contexts. Thus, a concept graphs 
represents information from the power context family, but it is supplied with 
a rhetoric structure and the information is not separated with respect to the 
individual contexts. In Figure 1 we see an example for a concept graph, albeit 
without a power context family it refers to. The power context family is omitted, 
because the purpose of the example is to visualize how rhetoric structure can be 
represented in a graph. The concept graph can be read rather straightforward: 
‘Tom is a cat, Jerry is a mouse and Tom chases Jerry’ (or, in short, ‘The cat 
Tom chases the mouse Jerry’). 





1 


( \ 


2 




CAT: Tom 




— 1 chase 




MOUSE: Jerry 



Fig. 1 . A concept graph 



Although it is possible to define a single graph which captures every infor- 
mation given in a power context family (cf. [Da03a]), usually only parts of the 
information are (purpose-oriented) represented in a graph. While the theory of 
concept graphs is well developed and includes numerous of Sowa’s extensions (see 
[Pr98], [Wi02], [SW03]), it lacks a negation. Such a negation can be considered 
on different levels: In this paper we will consider so called protoconcept graphs 
which feature a (limited) negation on the level of concepts and relations; on the 
level of judgments negation has been introduced and discussed extensively in 
[Da03b]. This approach, however, leads to the classical mathematical negation 
and requires a separation in syntax and semantics. Since, in the present paper, 
we are interested in a semantic approach, Dau’s graphs are not considered here. 

It is necessary to discuss how the information given in a (proto)concept 
graph might be represented mathematically. Since concept graphs are always 
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defined with respect to a given power context family, the background knowledge 
is assumed to be codable in formal contexts. But why do formal contexts provide 
a suitable way of coding background knowledge without being to restrictive? One 
of the results of [Wi03] was that if the background knowledge consists of object- 
and concept implications as described by Brandom in his theory of discursive 
practice (cf. [Br94]), then it can be represented appropriately by formal contexts. 
Hence, in this paper we assume that our background knowledge is given via a 
power context family and refer the interested reader to [Wi03] . 

Now we can describe which information is transmitted by a concept graph. 
Obviously, the graph conveys every information unit which is expressed explic- 
itly. For example, the sample concept graph in Figure 1 obviously contains the 
information that Tom is a cat. However, a (proto)concept graph may convey 
information which is not represented in it directly, but results from the activa- 
tion of background knowledge. For instance, for the concept graph represented 
in Figure 1, the information that Tom is a cat may, depended on the preknowl- 
edge coded in the power context family, also transmit that Tom is an animal. 
The sum of all these (directly and indirectly) represented information units then 
constitutes the so-called conceptual content of the concept graph. 

As argued above, the formalizations of the three doctrines are inter-related. 
In [Wi03], it was even shown that concept graphs as (affirmative) judgments can 
be made accessible through FCA. This was done by proving that the conceptual 
contents of concept graphs are formally derivable as concept extents. This paper 
aims at generalizing the result of [Wi03] to protoconcept graphs (i.e. judgments 
with a negation on the level of concepts and relations), thereby strengthening 
the inter-relationship of FCA and Boolean Judgment Logic (see [WiOl]). In par- 
ticular, we will find that although protoconcept graphs are based on the more 
general structure of protoconcept algebras instead of concept lattices, it is never- 
theless possible to express the conceptual content as extent of a suitably chosen 
context. Moreover, a closer study of the closure system of conceptual contents 
reveals a rather simple structure, which may simplify the problem of drawing 
these lattices. 

This paper consists of three more sections. In Section 2, the basic definitions 
are introduced and explained by an example. In the third section, it is shown 
that the conceptual contents of protoconcept graphs of a given power context 
family (see [Wi02]) can be described as extents of a formal context. This is done 
in two steps: First, power context families consisting of a single context only are 
considered, then the result is extended to power context families of limited type 
in general. Section 4 contains some suggestions for further research. 

2 Basic Definitions 

In this section, we focus on reviewing several definitions. We assume that the 
reader is familiar with all basic notions of FCA (for an extensive introduction see 
[GW99]). First we recall several definitions from [Wi02]: Let K := (G,M,I) be 
a formal context. A protoconcept of K is a pair (A, B) with A C G and B C M 
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such that A 1 = B 11 (which is equivalent to A 11 = B 1 ). The set *}3(K) of all 
protoconcepts of IK is structured by the generalization order C, defined by 

(. A\,B\ ) C ( A2,B2 ) Ai C A2 and B\ D B2, 
and by operations defined as follows: 

{A\,B\) n (A2, B2) := (A\ n A2, (Ai n A2) 1 ) 

(A\,B\) U (A2, B2) := ((Bi fl B2 ) 1 , B\ fl B2) 

<A,B) := (G\A, (GW) 

<A,B) := ((M \ B) 1 , M\B) 

T := (G, 0 ) 

-L := ( 0 , M). 

The set fp(K) together with the operations n,U,->,- l ,T and _L is called the 
algebra of protoconcepts of IK and denoted by <}J(K). The operations are called 
meet, join, negation, opposition, all and nothing. Figure 2 shows an example of 
a formal context with its protoconcept algebra. The elements except for T and 
_L are numbered in order to make them more easily accessible for reference. 

Moreover, we define a semiconcept of IK as a pair (A, B) with A C G and 
B C M such that A 1 = B or B 1 = A and define fj(K) to be the set of all semi- 
concepts of IK. Obviously, each semiconcept is a protoconcept. In particular, we 
define the sets T)n(K) := {(A, A 1 ) \ A C G} and f) u (K) := {(B 1 ,B) \ B C M} 
of n— semiconcepts and U— semiconcepts, respectively. The fl— semiconcepts of 
the algebra in Figure 2 are marked by circles with the lower half filled, those 
which are elements of flu are represented by circles whose upper half is filled. 
Those circles which are completely filled are concepts. The concept lattice of 
the context always consists of the elements in the intersection of the U— and the 
nsemiconcepts. Note that whenever an operation is performed, we obtain a semi- 
concept. In particular, the result of the operations n, 1 is a n— semiconcept, 
and any result of the operations U, T is a U— semiconcept. It is a well-known 
fact that fon(IK) := (fj n (lK), n, U, T, _L) (where aU6 := ->(^a n ->6) and T := 
-i_L) is isomorphic to the Boolean algebra of all subsets of G. Now we define 
7 : G — > T)n(IK), 7 (g) = ({g},{5} 7 ) and h- M ^ £>u(K), ~p(m) = ({m} 7 ,{m}). 
The set of U— irreducible elements of that lattice is {7(g) | g £ G} and the set of 
17— irreducible elements is equal to {(G\ {g}, (G\ {g}) 7 ) | g £ 0} (for a detailed 
discussion see [WiOOa]). The objects and attributes in Figure 2 are attached to 
the circles corresponding to their images under 7 and Jl, respectively. 

Definition 1. A power context family IK := (Kfe)jt = o,i,2,... is a family of contexts 
Kfc := (Gfe,Mfc,/fc) such that G& C Gg for k £ N. The power context family 
is said to be of limited type n £ N if IK := (Ko, Ki, . . . , IK n ), otherwise it is 
called unlimited. The elements of ip(Ko) are called protoconcepts , the elements 
of D'tjj := (J fegN ?P(]Kfc) are called relation protoconcepts. 

As an example for a power context family consider IK := (Ko, Ki, K2) with 
the contexts Kg and K2 in Figure 3. The context Ki is the empty context. 
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Fig. 2. A formal context and its protoconcept algebra 
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Fig. 3. A power context family 



Next, we repeat the definition of protoconcept graphs as introduced in [Wi02]: 
The underlying structure is a so-called relational graph , which is a triple (V. E, v) 
consisting of a set V of vertices, a set E of edges and a mapping v. E — > 
Ufc=i 2 V k which maps each edge to the ordered tuple of its adjacent vertices. 
If v{e) = (ui, . . . , Vk), then we say that the arity of e is k. The vertices are said 
to have arity 0, i.e. we set E := V. Moreover, let E^ (k = 0, 1, . . .) be the 
set of all k — ary edges. 

Definition 2. A protoconcept graph of a power context family K := (Kq, Ki , . . .) 
with Kk := (Gfc, Mfc, Ik) for k = 0, 1, 2, ... is a structure 0 := (V, E, v, re, g) for 
which 

— (V, E, v) is a relational graph, 

— re: V U E — > (J fc=0 1 fp(Kfe) is a mapping k(u) € ip(Kfc) for all u £ E ^ 
(k= 0,1,...), 

— g: V — > V(Gq) \ {0} is a mapping with £) + (T) := g(v) fl Ext(re(r>)) and 

g~(v ) := g(v) \ g + (v) satisfying that, for v(e) = («i, . . . , Vk), Q + (vj) ^ 0 for 
all j = 1 , . . . , k or g~(vj) ^ 0 for all j = 1 , . . . , k, g + (v i) x • • • x g + (vk) Q 
Ext(re(e)) and x • • • x g~(vk) C (Go) fc \Ext(re(e)). 
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We consider the mapping g not only on vertices but also on edges: For v(e) = 
(ui , . . . ,Vk), let g(e) := g + (e ) U g~(e ) with g + (e ) := p + (ui) x • • • x g + (v k ) and 
g~{e) := g~(v i) x • • • x g~{v k ). 

A sample protoconcept graph 0 := (V, k, g) is shown in Figure 4. The 
relational graph is ({u, w}, {e}, v) with v(e) = (v,w)\ moreover we have k(v) = 
7 (Tom), k(w) = 7 (Jerry) and re(e) = p(chases) and g(v) = {Tom, Jerry} = g(w). 
The graph then reads: Tom chases Jerry, Jerry does not chase Tom and Tom 
and Jerry are animated figures (since 7 (Tom) = ({Tom}, {animated figure}) 
and 7 (Jerry) = ({Jerry}, {animated figure})). In particular, the graph contains 
the conceptual information units ( g , k(u)) and (h, -<k(u)) with k = 0, 1 , . . u £ 
and g £ g+ (it), h £ g (u). For the graph in ou example, these are the 
pairs (Tom, 7 (Tom)), (Jerry, -> 7 (Tom)), (Jerry, 7 (Jerry)), (Tom, -> 7 ( Jerry)) and 
((Tom, Jerry), /l(chases)), ( (Jerry, Tom), -i/Z(chases)). 

^ //(chases)^ 

e 

Fig. 4. A protoconcept graph of the power context family 

Next, we introduce the conceptual content of a protoconcept graph in a way 
which is very similar to [Wi03]: First we define what protoconcept implications 
in Kfc are and then introduce the conceptual content as the disjoint union of the 
closures of the conceptual information units under these implications. 

Thus, let K := (Ko, . . . ,K„) be a limited power context family with IK*. := 
(Gfc,Mfe,Jfe) (k = 0,1,..., n). For 0, D C fp(Kfc), we say that the context K k 
satisfies 0 — > ID if | | C0 C fl®- I n particular, we set p — > p n p. The formal 
inferences 0 — > D give rise to a closure system C(K k ) on §* mp (Kfe) := {(g,p) £ 
Gfc x fp(Kfc) | g £ Ext(p)} consisting of all Y C § lmp (K) which satisfy 

(P) If A x 0 C Y and if Kj, satisfies 0 — > D then A x D CY. 

Now the k—ary conceptual content Gfc(0) for a protoconcept graph 0 := 
(F, E, v, k, g) of a power context family K is defined as the closure of {( g , k(u)) \ 
u £ E and g £ p + (u)} U {(g, ->k(u)) | u £ E ^ and g £ p _ (u)} with respect 
to the closure system G(Kfc). Then 

G(0) := G o (0) 0 Gi(0) U C 2 (0) U • • • 

is called the K — conceptual content of the protoconcept graph 0. The 0— ary 
conceptual content for the graph in Figure 4 is S lmp (Ko). Finally, we introduce 
the information (quasi-) order between protoconcept graphs by 0i < 0 2 :«• 
C(0i) C C(0 2 ). We then say that 0i is less informative than 0 2 . 

It is important to note that contrary to the approach in [Wi03] we do not 
consider so-called object implications in K*,, which were defined as A — > B :<£>■ 
A Ik C B Ik for A, B C G k - For concepts b, A Ik C B Ik then implies that B C 



7(Tom): {Tom} | {Jerry} 

v 



7( Jerry) : {Jerry} | {Tom} 
w 
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Ext(b) whenever A C Ext(b). However, in the case of protoconcept graphs we 
find that if two object sets satisfy A Ik C B Ik , then as soon as A % B there 
are protoconcepts p 1; p2 with A — Ext (pi) E Ext(p2) = B. In particular, for 
{ 5 i}, {52} with {gi} Ik C {g 2 } Ik we obtain that {gi} = Ext(7(gi)) and {32} = 
Ext (7(32))- Thus, the fact that 32 is in the extent of a protoconcept p together 
with {gi} Ik C {g 2 } Ik does not imply that 31 is in the extent of p as well. 

3 Conceptual Contents as Formal Concepts 

Before elaborating why conceptual contents of protoconcept graphs of arbitrary 
power context families can be understood as extents of a formal context, we focus 
on the most simple case, i.e. , on power context families IK := (IK). Thus, concep- 
tual contents of a formal context IK are the conceptual contents of protoconcept 
graphs of the power context family IK := (K) of limited type 0 . 

The first step is to represent the conceptual contents of IK in a more conve- 
nient fashion which no longer depends on the protoconcept graphs of a context, 
but solely on the context itself: 

Definition 3 . The structure S lmp (IK) := (§ ,mp (IK), <gimp(K), [J) with 

( 1 ) (5, P) <s*™p(K) ( h , q) :•» 3 = h and qEp 

(2) |jHx 0 :={( 3 ,n p) | 3 £ A} for A x 0 C S imp (K). 

peQ 

is called the implicational context structure of the context IK. The implicational 
closure C imp (X) of X C § lmp (IK) is the smallest order ideal of the structure 
(§* m P(K), <§*mp(K)) containing X closed under the partial multioperation |J. 
Moreover, let C(§ lmp (IK)) denote the set of all implicational closures of 8* mp (IK). 

Figure 5 shows the set of all closures of § lmp (IKo) ordered by set inclusion. 

Lemma 4 . The implicational closures of the implicational context structure 
gamp ( ik) are exactly the conceptual contents of IK. 

Proof: We prove that C(§ imp (IK)) = C'(IK): First let U £ C'(S lmp (IK)). We will 
prove that U satisfies the condition (P). Hence, we assume that A x fp C U and 
that IK satisfy *}3 — > 0 . First we show A x 0 C S lmp (K): We have A C Ext(p) for 
all p £ ip, which implies that A C P| pg ,pExt(p) = Ext(|~lp e tpp)- Since ^ 0 
implies | | pe q3 P E q E £> for all 5 £ 0 , we obtain A C Ext(D) for all 

c) £ 0 . Thus, A x 0 C § lmp (IK). Moreover, for all 3 £ A we have (g,d) < 

(^rUnq) < G?,n pe <pP) £ U Ax C U, hence A x 0 C U. Obviously, 
(U, 0 , 0 , k, g) with K((g , p)) = p and g((g, p)) = {3} is a protoconcept graph of IK 
with U as conceptual content. 

Now, let U £ C(IK). First we show that U is an order ideal in §* mp (IK): Let 
(3, p) £ U and (h, 0 ) < (3, p). Then h = 3 and, since p E 0 implies p FI p C 0 n 0, 
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C(S imp (K 0 )): 




we obtain that K satisfies {p} — > {h}. The fact that U satisfies ( P ) then yields 
{5} x {D} C U, hence (h, 0) £ U and U is an order ideal. Finally, let A x 0 C U . 

Since IK satisfies 0 — > [~| 0, we obtain | |^4 x 0 = A x { | | 0} C U, which finishes 

the proof that the conceptual content U is an implicational closure of § lmp (IK). 

□ 

It can easily be checked that both sets C(IK) and C'(S lmp (IK)) are closure sys- 
tems. However, the set of all closures, ordered by inclusion, is always a complete 
lattice. We obtain that the extents of the context (§* mp (IK), C(S lmp (IK)), e) are 
exactly the closures, which would already yield the desired result. However, we 
aim at a more thorough understanding of the structure (C(§* mp (IK)), C). 

In particular, we would like to find the /\ —irreducible elements of the com- 
plete lattice (C(S lmp (IK)), C), thus those closures which cannot be represented 
as intersections of other closures. For the lattice C(§ lmp (IK)), every element is 
then the intersection of a set of these /\ —irreducible closures, resulting in both 
a explicitly described structure of (0(§ lmp (IK)), C) and in fewer attributes for 
the context whose extents will be the implicational closures. The latter follows 
from the fact that the concept lattice of the context with object set S lmp (K) 
and whose attributes are the f\ —irreducible elements of C(S lmp (IK)) will have 
the closures of S lmp (IK) as extents as well (in particular we will find that the 
f\ —irreducible elements are f\ —dense). 

We will now proceed as follows: We show that C(8 lmp (IK)) is isomorphic 
to a product of lattices which can be described rather easily. Then we express 
the /\ —irreducible elements of the product lattice in terms of elements of the 
factors. Using the isomorphism, we are then able to determine the /\ —irreducible 
closures in C(S imp (K)). 

The next Proposition is similar to Proposition 1 in [PW99]. There, it was 
shown that the lattice iZ(IK) of equivalence classes of concept graphs of a given 
power context family IK is isomorphic to the subdirect product of specific sublat- 
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tices of the concept lattices S(Kj-), each extended by a new T element. Here, 
we show that the lattice of conceptual contents of a context K is ismorphic to 
the direct product of sublattices of the Boolean algebra S) n (IK) , each extended 
by a new J_— element: 

Proposition 5. Let IK := (G, M, I) be a context. We define f) g := fj n (lK)n[7(g)) 
and V g := ({g} x p g ) U {J_ g } and define an order on V g by v < w if v = X g or 
v <s-P (K) w (v,w £ V g ). Then G(S™ P (K)) = Y\ a&G Vg- 

Proof: We prove that the mapping 

i p'- n^-^-(K)) 

seG 

( a g ) g eG ' — » IJ {3} x [q s ) 

a g = (9- £ Ig)eV g \{_L g } 



is an order-isomorphism. 

First we show that ip is well defined, i.e. that <^((a g ) ggG ) £ G(§ lmp (K)) for 
all ( a g ) g&G ■ Let X := <^((a g ) seG ) = Ua 9 =(g,q 9 ) e y 9 \{_L 9 }{3} x [q»)- We need to 

check that X is an order ideal in S* mp (IK) which is closed under | |. First, let 

(g, p) £ X and (h, q) <§im P ( K ) (g,p). Then h = g and p C q. However, since 
(g,p) £ X , we have a g ^ _L g . Thus, there exists a p 9 £ f) 9 with a g = (g,p g ). 
Moreover, (g,p) £ X implies p 9 C p. Since p C q, this in turn implies p g Z q. 
Hence q £ [p g ), and therefore (g, q) = (h, q) £ X. Next, let A x 1} C X. Hence, 
for each g £ A we have {g} x Q C X. Then a g = (3, q g ) for some q £ f) g , 
yielding p g Z q for all q £ 0. Since p g is a n— semiconcept, it is the smallest 
protoconcept with extent Ext(p g ) and we obtain p g Z nfl. Thus, for all g £ A 
we have (<7, nQ) £ X , yielding in turn that [_\A x Q C X. This finishes the proof 
that X € G(S lmp (IK)).' 

Next we show that ip is an order-embedding, i.e. that (a g ) ggG <v 9 ( b g)gtG «• 
y>((a g ) geG ) C <p((6 g ) geG ). Thus, let (a g ) geG <n v 9 (MgeG- This is equivalent 
to a g <v g b g for all g £ G. First we notice that a g = _L g is equivalent to 
^(K)s 6 g) n ({ g } x [7(g))) = 0. Moreover, for a g = (g, p g ) and b g = (g, q g ) 
we find a g < Vg b g 77 (g, p g ) < s . mP(K) (g, q g ) 77 q g Z p g 77 [p g ) C [q g ) 77 
{ 3 } x [p g ) C {g}x[q g ) 77 ^((a g ) geG )D({5}x[7(3))) C ^((& g ) ggG )n({.g} x [7(3))). 
Hence, (a g ) ggG <jq y (& g ) ggG is equivalent to the fact that for all g £ G we 
have <p((a g ) g£ G ) D ({g} x [7(g))) C <p((b g ) geG ) D ({g} x [7(g))), which in turn is 
equivalent to <p((a g ) geG ) Q <p((bg) g eG)- 

The fact that ip is an order-embedding implies that ip is injective. What is left 
to show is that ip is surjective. In order to prove this, for each S £ G(S lmp (IK)) 
we need to find an (a g ) ggG with <p((a g ) geG ) = S. Thus, let S £ G(S lmp (IK)), let 
A := {g £ G | 3p £ fp(K).(g, p) £ S}, and for each g £ A we consider the set 
'Pg := {p £ *P(IK) | (g, p) £ S}. Since S' is a closure, we obtain for each g £ G that 
U{3}X^g C S, thus (g, nfp g ) £ S. Since nfp g is a n— semiconcept by definition, 
we have nijJ g £ f) g and hence *jJ g = [flfp g ). Therefore, S = U g eA{3} x [n^Pg)- 
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Finally, we set (a g ) gg G with 

„ _ f (3: n< Ps) if 9 e A 
9 "Us if at A. 



It can now be easily checked that <p((a g ) gg G) = S, finishing the proof that ip is 
an order-isomorphism. □ 

In Figure 6 we find a visualization of Proposition 5. for C(S lmp (]Ko)). 




A (Jerry, .3) 



■ (Jerry, 5) 



W J- Jerry 



Since each of the lattices V_ g is complete, the product ri g eG i s a complete 
lattice as well. The lattice V_ g corresponds to the Boolean lattice f) g := (f)n(K) D 
[ 7 ( 5 )), 3), with an additional _L— element. The /\ —irreducible elements of V g 
are therefore fairly easy to characterize: They consist of the set of coatoms 
{({g,h}, ({g,h}Y) G t) g | h ± g} of (Tin(lK) n [ 7 (g)), 3) plus the additional 
element _L g . It is easy to check that these /\ —irreducible elements arc /\ —dense 

iuZ g - 

Having characterized the f\ —irreducible elements of the factor-lattices, we 
may now determine the corresponding elements of the product: 

Lemma 6. Let |G| > 1. The element (a g ) g ^G G TI gg G ^ /\— irreducible in 

gG V g if and only if there is exactly one h £ G such that ah is f\ — irreducible 
in Vh and if a g = T y for all g ^ h. Moreover , the f\ — irreducible elements are 
A -dense in n geG _^. 

Proof: First we show that each (a g ) gg G G IlgeG ^ satisfying the condition is /\- 
irreducible. Thus, let there be exactly one h G G such that ah is A —irreducible 
in Vh and a g = T y g for all g ^ h. Now we assume that 

(a g ) ggG = /\ (bg)geG- 

(Ms EG>K) 9 £G 

Then b g = Ty g — a g for all g ^ h and ah = f\b h >a h ^h- This, however, contradicts 
the assumption that ah is A —irreducible in Vh. Hence, (a g ) gg G is A~i rre ducible 

in rigesls- 
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Next we prove that each /\ —irreducible element of the product-lattice has the 
form described above: If for (a g ) g ^G € Y\ 9 ^g th ere are two distinct elements 
hi,h 2 eG with a hl T Vhi and a h2 then ( a g ) geG = (b g ) g &G A (c s ) geG 

with 

_ f a g if g 7^ hi _ f a g if 9 ^ 

9 \ Ty hi otherwise 9 |Ty h2 otherwise 

Hence, ( a g ) g6G is not f\ —irreducible in IlgeG 

Finally, the fact that that the f\ —irreducible elements of ]~[ sG g are A~ 
dense in the product immediately follows from their construction. □ 

As an immediate result from Proposition 5. and Lemma 6. we obtain 

Corollary 7. For ft £ G, let I(jfe) := {p G ^(K) |pc(G \ {fc},G\ {k} 1 )}. 
Then the /\ —irreducible elements of C(S lmp (K)) are exactly the elements of 
M irr (K) := M£ r (K) U M" r (K) with 

MT(K) := {S imp (K) \ ({ g } x qj(K)) | g G G} and 
MZ r (K) := {§™ P (K) \ { 3 } x I(k) \ g,k e G,g ^ k}. 

Proof: We apply the isomorphism from Proposition 5. to the /\ —irreducible 
elements of Lemma 6.. □ 

We may thus define in accordance to [Wi03] the so-called conceptual infor- 
mation context ofK. as := (S lmp , M lrr ( K), G), enabling us to state the first 
main result of this paper: 

Theorem 8. (Basic Theorem on K— Conceptual Contents) For a formal 
context K, the extents of the conceptual information context (K) are exactly 
the implicational closures of the implicational context structure S lmp (K); those 
implicational closures are exactly the conceptual contents of K. 

Proof: Since the extents of (K) are exactly the intersections of attribute 
concepts and since the elements of G(S imp (K)) are exactly the intersections of 
M lrr (K ), Corollary 7. yields the first result. With Lemma 4. we then obtain the 
second claim. □ 

We may now extend the Basic Theorem on K-Conceptual Contents to the 
general case of (limited) power context families. The construction and the proof 
of the Theorem are essentially the same as in [Wi03]. However, in order to make 
this paper self-contained we repeat both the theorem and the proof, with a few 
adjustments to suit our purpose. Let K := (Ko,Ki, . . . ,K„) be a power context 
family with := (G&, M*,, Ik) (k = 0,1,..., n). The conceptual information 
context corresponding to IK is defined as the formal context 

K inf (K) := K inf {K 0 ) + K in/ (Ki) + • • ■ + K inf (K n ), 

thus as the direct sum of the contexts K. m f (K^) (ft = 1, ... , n). An extent U of 
K m -f(K) is said to be rooted if ((gi, . . . ,gk), bfc) £ U implies (<?j,To) G U for 
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j = 1 ,k and To := (G 0 , Gq°). Rooted extents are needed in order to identify 
the graphs with extents of the context K m ^ (IK) (an extent which is not rooted 
would corrspond to a graph which has an edge but is missing at least one of the 
adjacent vertices). Now we are able to formulate the desired theorem: 

Theorem 9. (Basic Theorem on IK — Conceptual Contents) For a power 
context family IK of limited type n the conceptual contents of the protoconcept 
graphs of IK are exactly the rooted extents of the corresponding conceptual 
information context K m ^(IK). 

Proof: By definition, the conceptual content of a protoconcept graph is the 
disjoint union G(0) := Co(0) 0 Ci(0) 0 • • ■ 0 C n (0). By Theorem 8., for each 
k = 0,1,..., n the conceptual content Gfe(0) is an extent of IK m ^(IKfc). Since 
]K m /(IK) is the direct sum of all these contexts, we find that G(0) is an extent 
of K m ^(IK). This extent is rooted as a direct consequence of the definition of 
protoconcept graphs. Conversely, let U be a rooted extent of K m ^(IK). Then 
Uk ■= U D G*, is an extent of (IK*,) for each k = 0, 1, . . . , n and therefore an 
implicational closure of S lmp (IKfc) by Theorem 8.. Now we define a protoconcept 
graph 0 := (V, E, v, k, g) by V := U 0: E := (J fc =i,..., n U k , v((gi, ■ • ■ , 9k), Pk) ■= 
{( 91 , T 0 ), , (gk, T 0 )), /c((s,po)) := Po, «(((fl'i, • • • , 9k), Pk)) ■= Pk, g{{g, Po)) := 
{g}. Obviously, this protoconcept graph has U as conceptual content. □ 

This result can be considered as a further step in elaborating the inter- 
relationships of Contextual Logic. Generalizing the result in [Wi03], it makes 
protoconcept graphs accessible via Formal Concept Analysis by proving that 
even conceptual contents of judgments with a (local) negation are derivable 
as formal concepts. Moreover, Proposition 5. and the two basic theorems give 
us some structural insight regarding the closure systems C(K) and C(K) which 
might prove valuable for the development of TosCANA-systems for protoconcept 
graphs of power context families (see [EGSW00]). 

4 Outlook 

In this paper, object implications in power context families were not considered 
at all (due to reasons which were explained at the end of Section 2). However, 
it seems possible to include a variant of object implications in this theory by 
defining a closure closed under object implications using only instances which 
are in S lmp (IK). Since this approach exceeds the range of this paper, it would be 
desirable to continue the study of conceptual contents of protoconcept graphs. 

In [Da04], it was described how a logic approach to concept graphs can ef- 
fectively deal with object and concept implications via new derivation rules. It 
might be interesting to see if and how this theory can be transferred to a theory 
of protoconcept graphs which is based on a separation in syntax and semantics. 
In particular, the relation of such a syntactic approach and the semantic theory 
described in the present paper might lead to new insights with respect to the 
overall theory of protoconcept graphs. 
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1 Introduction 

In this paper we propose a semiotic conceptual framework which is compatible with 
Peirce’s definition of signs and uses formal concept analysis for its conceptual structures. 
The goal of our research is to improve the use of formal languages such as ontology 
languages and programming languages. Even though there exist a myriad of theories, 
models and implementations of formal languages, in practice it is often not clear which 
strategies to use. AI ontology language research is in danger of repeating mistakes that 
have already been studied in other disciplines (such as linguistics and library science) 
years ago. 

Just to give an example of existing inefficiencies: Prechelt (2000) compares the imple- 
mentations of the same program in different programming languages. In an experiment 
he asked programmers of different languages to write a program for a certain problem. 
All programmers of so-called scripting languages (Perl, Python, Tel) used associative 
arrays as the main data structure for their solution, which resulted in very efficient code. 
C++ and Java programmers did not use associative arrays but instead manually designed 
suitable data structures, which in many cases were not very efficient. In scripting lan- 
guages associative arrays are very commonly used and every student of such languages 
is usually taught how to use them. In Java and C++, associative arrays are available via 
the class hierarchy, but not many programmers know about them. Therefore scripting 
languages performed better in the experiment simply because programmers of Java and 
C++ were not able to find available, efficient data structures within the large class li- 
braries of these languages. Of course, this does not imply that scripting languages always 
perform better, but in some cases apparently large class libraries can be a hindrance. 

These kinds of problems indicate that the challenges of computing nowadays lie 
frequently in the area of information management. A semiotic-conceptual framework as 
proposed in this paper views formal languages within a system of information manage- 
ment tasks. More specifically, it identifies management tasks relating to names (names- 
paces), contexts and (object) identifiers as the three contributing factors. These three 
management tasks correspond to the three components of a sign: representation, context 
and denotation. 

It has been shown in the area of software engineering that formal concept analysis can 
be used for such information management tasks (Snelting, 1995). Snelting uses formal 
concept analysis for managing the dependencies of variables within legacy code. But 
we argue that it is not obvious how to connect the three different areas of management 
tasks to each other if considering only conceptual structures, because sign use involves 
semiotic aspects in addition to conceptual structures. The semiotic conceptual framework 

P. Eklund (Ed.): ICFCA 2004, LNAI 2961, pp. 28-38, 2004. 
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described in this paper facilitates a formal description of semiotic aspects of formal 
languages. It predicts the roles which conceptual and semiotic aspects play in formal 
languages. This is illustrated in a few examples in section 8 of this paper. It should be 
pointed out, however, that this research is still in its beginnings. We have not yet explored 
the full potential of applications of this semiotic conceptual framework. 

2 The Difference between Signs and Mathematical Entities 

A semiotic conceptual framework contrasts signs with mathematical entities. The vari- 
ables in formal logic and mathematics are mathematical entities because they are fully 
described by rules, axioms and grammars. Programmers might think of mathematical 
entities as “strings”, which have no other meaning apart from their functioning as place 
holders. Variables in declarative programming languages are richer entities than strings 
because they have a name, a data type and a value (or state) which depends on the time 
and context of the program when it is executed. These variables are modelled as signs 
in a semiotic conceptual framework. 

One difference between mathematical entities and signs is the relevance of context. 
Mathematics employs global contexts. From a mathematical view, formal contexts in 
formal concept analysis are just mathematical entities. The socio-pragmatic context of 
an application of formal concept analysis involves signs but extends far beyond for- 
mal structures. On the other hand, computer programs are completely formal but their 
contexts always have a real-time spatial-temporal component, including the version of 
the underlying operating system and the programmer’s intentions. Computer programs 
cannot exist without user judgements, whereas mathematical entities are fully defined 
independently of a specific user. 

Many areas of computing require an explicit handling of contextual aspects of signs. 
For example, contextual aspects of databases include transaction logs, recovery routines, 
performance tuning, and user support. Programmers often classify these as “error” or 
“exception” handling procedures because they appear to distract from the elegant, log- 
ical and deterministic aspects of computer programs. But if one considers elements of 
computers as “signs”, which exist in real world contexts, then maybe contextual aspects 
can be considered the norm whereas the existence of logical, deterministic, algorithmic 
aspects is a minor (although very convenient) factor. 

3 A Semiotic Conceptual Definition of Signs 

Peirce (1897) defines a sign as follows: “A sign, or representamen, is something which 
stands to somebody for something in some respect or capacity. It addresses somebody, 
that is, creates in the mind of that person an equivalent sign, or perhaps a more developed 
sign. That sign which it creates I call the interpretant of the first sign. The sign stands for 
something, its object .” Our semiotic conceptual framework is based on a formalisation of 
this definition, which is described below. To avoid confusion with the modern meaning 
of “object” in programming languages, “denotation” is used instead of “object”. 

A representamen is a physical form for communication purposes. Representamens 
of formal languages, for example variable names, are considered mathematical entities 
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in this paper. All allowable operations among representamens are fully described by 
the representamen rules of a formal language. Two representamens are equal if their 
equality can be mathematically concluded from the representamen rules. For example, 
representamen rules could state that a string “4+1” is different from a string “5”, whereas 
for numbers: 4 + 1 = 5. 

In a semiotic conceptual framework, Peirce’s sign definition is formalised as fol- 
lows: A sign is a triadic relation (rmn(s) , den(s) , ipt(s)) consisting of a representamen 
rmn(s), a denotation den(s) and an interpretant ipt(s) where den(s) and ipt(s) are 
signs themselves (cf. figure 1). It is sometimes difficult to distinguish between a sign 
and its representamen, but rmn(s) is used for the mathematical entity that refers to the 
sign and s for the sign itself. 




v formal concept __ ) 



context 



Fig. 1 . The sign triad 



Even though the three components of the sign are written as mappings, rmn(), 
den(), ipt(), these mappings only need to be defined with respect to the smallest pos- 
sible interpretant which is the sign’s own interpretant at the moment when the sign is 
actually used. Further conditions for compatibility among interpretants must be pro- 
vided (see below) before triadic sign relations can be considered across several or larger 
interpretants. Such compatibility conditions must exist because otherwise signs would 
be completely isolated from each other and communication would not be possible. 

Interpretants and denotations are signs themselves with respect to other interpretants. 
But they also relate to mathematical entities. Denotations relate to (formal) concepts. 
Interpretants relate to contexts, which are in this paper defined as the formalisable aspects 
of interpretants. Interpretants contain more information than contexts. According to 
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Peirce, interpretants mediate between representamens and denotations. Therefore an 
interpretant or context only needs to contain as much information as is needed for 
understanding a sign. Because “a sign, or representamen, is something which stands 
to somebody for something in some respect or capacity” (Peirce, 1897), it follows that 
signs are not hypothetical but actually exist in the present or have existed in the past. 



4 Synonymy and Similar Sign Equivalences 

The definition of signs in a semiotic conceptual framework does not guarantee that 
representamens are unambiguous and represent exactly one denotation with respect to 
one interpretant. Furthermore, the definition does not specify under which conditions 
a sign is equal to another sign or even to itself. Conditions for interpretants must be 
described which facilitate disambiguation and equality. 

A set / of n interpretants, i\, ..., i n , is called overlapping iff 

Va, 1 <a<n^b, 1 <b<n,b^a^s • 2 a ^ ^ S (1) 

where s denotes a sign. The arrow relation >” is called “representation” and is the 
same as in figure 1. This relation is central to Peirce’s definition of signs but shall not be 
further discussed in this paper. 

With respect to a set I of overlapping interpretants, any equivalence relation can be 
called synonymy, denoted by =/, if the following necessary condition is fulfilled for all 
signs represented by interpretants in / : 

(rmn(si),den(si),ipt(si)) =/ (rran(s 2 ), den(s 2 ), ipi(s 2 )) => (2) 

Si —> de?r(s 2 ), s 2 —> den(s i), den(s i) =/ den(s 2 ) 

Only a necessary but not sufficient condition is provided for synonymy because it 
depends on user judgements. In a programming language, synonymy can be a form of 
assignment. If a programmer assigns a variable to be a pointer (or reference) to another 
variable’s value, then these two variables are synonyms. Denotational equality is usually 
not required for synonymous variables because values can change over time. For example 
two variables, “counter := 5” and “age := 5”, are not synonymous just because they have 
the same value. Because synonymy is an equivalence relation, signs are synonymous to 
themselves. 

A further condition for interpretants ensures disambiguation of representamens: A 
set I of overlapping interpretants is called compatible iff 

Vn,i 2 e/V Sl , S2 : (*i ->■ Si,i 2 ->■ s 2 , r?nn(si) = nnn(s 2 ) Si =/ s 2 ) (3) 

In the rest of this paper, all single interpretants are always assumed to be compatible with 
themselves. Compatibility between interpretants can always be achieved by renaming 
of signs, for example, by adding a prefix or suffix to a sign. 

The following other equivalences are defined for signs in a set I of compatible 
interpretants: 
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identity: Si X; s 2 '■<= 


=> id(si) = id(s 2 ), 


Sl =7 S2 




(4) 


polysemy: Si =/ s 2 :<= 




rmn(si) 


= rmn(s 2 ) 


(5) 


equality: Si =/ s 2 :<= 


=> den(s i) =/ den(s 2 ), 


rmn(si) 


= rmn{s 2 ) 


(6) 


equinymy: s i =/ s 2 = 


=> den(s i) =/ den(s 2 ), 


Sl =7 s 2 




(7) 


Sl =/ s 2 *£= 


= Si =/ s 2 









Identity refers to what is called “object identifiers” in object-oriented languages 
whereas equinymy is a form of value equality. For example, in a program sequence, 
“age := 5, counter := 5, age := 6”, the variables “age” and “counter” are initially equal. 
But “age” is identical to itself even though it changes its value. Identity is implemented 
via a set X of mathematical entities. The elements of X are called identifiers. A mapping 
id() maps a sign onto an identifier or onto NULL if the sign does not have an identifier. It 
should be required that if two signs are equal and one of them has an identifier then both 
signs are also identical. The only operation or relation that is available for identifiers is 
“=”. In contrast to synonymy which is asserted by users, object-oriented programming 
languages and databases have rules for when and how to create “object identifiers”. 

Because the relations in (4)-(7) are equivalence relations, signs are identical, poly- 
semous, equinymous and equal to themselves. In (5) polysemy is defined with respect 
to equal representamens but only in compatible interpretants. Signs with equal rep- 
resentamens across non-compatible interpretants are often called “homographs”. This 
definition of “polysemy” is different from the one in linguistics which does not usually 
imply synonymy. 

The following statements summarise the implications among the relations in (4)-(7). 

si x s 2 or Si =i S 2 or si =/ s 2 => Si =/ s 2 (8) 

Si = s 2 •<=>• Si =/ s 2 , Si =/ s 2 (9) 



5 Anonymous Signs and Mergeable Interpretants 

Two special cases are of interest: anonymous signs and mergeable interpretants. In pro- 
gramming languages, anonymous signs are constants. An anonymous sign with respect 
to compatible interpretants I is defined as a sign with 

s =/ den(s) (10) 

An anonymous sign denotes itself and has no other representamen than the repre- 
sentamen of its denotation. Signs which are anonymous with respect to one interpretant 
need not be anonymous with respect to other interpretants. 

The following equations and statements are true for anonymous signs s, Si, s 2 

s =i den(s) => s =/ den(s) =i den(den(s)) =/ . . . (11) 

s =j den(s) => rmn(s) = rmn(den(s)) = rmn(den(den(s)) = . . . (12) 

si =i s 2 *£=>■ den(si) =/ den(s 2 ) (13) 

si =/ s 2 (14) 



si =i s 2 
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Statement (14) is true because of si = s 2 =>■ den(si) =7 den(s2) =>■ s 1 =7 s 2 . 
Thus for anonymous signs equality and equinymy coincide. It is of interest to consider 
interpretants in which equinymy and synonymy coincide. That means that synonyms 
have equal instead of just synonymous denotations. This leads to the next definition: 

A set / of compatible interpretants is called mergeable iff for all signs in I 



Si =/ S 2 =4- Si —7 s 2 (15) 

which means that all of its synonyms are equinyms. If an interpretant is not mergeable 
with itself it can usually be split into several different interpretants which are mergeable 
with themselves. In mergeable interpretants, it follows that 



Si =/ s 2 Si =1 s 2 => si =7 s 2 Si =7 s 2 (16) 

si =7 s 2 4=4- rmn(si) = rmn(s 2 ) (17) 

because rran(si) =rran(s 2 ) => dmn(s 1 ) =7 drnn(s 2 )=> dmn(s 1 ) =7 drrm(s 2 ). 

From (14) and (16) it follows that for anonymous signs in mergeable interpretants, 
the four equivalences, synonymy, polysemy, equality and equinymy are all the same. For 
anonymous signs in a mergeable interpretant, the representamen rules alone determine 
synonymy. Because representamens are mathematical entities, it follows that anonymous 
signs in mergeable interpretants behave like mathematical entities. 

6 Conceptual Structures 

While semiotic structures, such as synonymy, model the decisions a user makes with 
respect to a formal language, mathematical entities can be used to compute the con- 
sequences of such decisions. It is argued in this paper that the mathematical entities 
involved in signs (especially formal concepts and contexts) can be modelled as con- 
ceptual structures using formal concept analysis. Concept lattices can be used to show 
the consequences of the semiotic design decisions. Users can browse through concept 
lattices using a variety of existing software tools to explore the signs. 

This insight is not new. In fact there are several papers, (for example, Snelting ( 1 996)) 
which demonstrate the usefulness of formal concept analysis in software engineering. 
Our semiotic conceptual framework adds a layer of explanation to these studies by 
detailing how semiotic and conceptual aspects both contribute to formal languages. A 
semiotic perspective also adds modes of communication or “speech acts” to the concep- 
tual framework. Examples are “assertion”, “query” and “question”. But these are not 
further discussed in this paper. 

Formal concept analysis, description logics, Sowa’s (1984) conceptual graphs. Bar- 
wise & Seligman’s (1997) classifications and object-oriented formalisms each provide a 
slightly different definition of concepts. But they all model denotations as binary relations 
of types and instances/values. Thus denotations are signs of the form [typ(s) : ins(s)] 
where typ(s) is a sign called type and ins(s) is a sign called instance (or value). A sign 
s with den(s) =7 [typ(s) : ms(s)] is written as s[typ(s ) : tns(s)]. A formal concept is 
an anonymous sign c =7 ({ei, e 2 , . . .}; {*i , « 2 , . . .}) where ei, e 2 , . . . are mathematical 
entities that form the extension and *i, i 2 , . . . are mathematical entities that form the 
intension of the formal sign. 
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For a fixed interpretant, denotations are mapped onto formal concepts as follows: 
the intension is the set of types that are inherited by the sign via a type hierarchy and 
the extension is the set of instances that share exactly those types. As a restriction it 
should be required that each sign is synonymous to at most one formal concept within 
the given interpretant. Within the framework of formal concepts, equality of denotations 
can be mathematically evaluated because formal concepts are anonymous signs. Formal 
concepts as defined in this paper form concept lattices as defined in formal concept 
analysis. 

7 Contexts and Meta-constructs 

Several formalisms for contexts have been suggested by AI researchers (eg. McCarthy 
(1993)) but in general they are not integrated into reasoning applications as frequently 
and not as well understood as representamens and formal concepts. Figure 1 indicates 
that contexts should play an important role in the formalisation of signs besides repre- 
sentamens and formal concepts. We argue in this paper, that contexts are not as difficult 
to deal with as AI research suggests if they are modelled as formal concepts as well. 

If contexts are modelled as formal concepts, relationships between contexts, such as 
containment, temporal and spatial adjacency or overlap can be modelled as conceptual 
relations. Contexts as formal concepts are denotations of other signs with respect to other 
interpretants. Peirce stresses the existence of infinite chains of interpretants (interpretants 
of interpretants of interpretants ...). But as formal concepts, contexts are not modelled 
as contexts of contexts of contexts. All contexts can be modelled as formal concepts 
with respect to one special meta-context of contexts because containment chains are just 
conceptual relations, not meta-relations. 

An advantage of this approach is that apart from one meta-language which de- 
scribes the semiotic conceptual framework, no other meta-languages are required. All 
other seemingly “meta”-languages are modelled via conceptual containment relations 
between their corresponding contexts. For example, all facts, rules and constructors of a 
programming language can be described in a single context. A program of that language 
is executed in a different context. Both contexts are related via a containment relation 
with respect to the meta-context of contexts. But the context of a programming language 
is not a meta-context of a program. 

8 Examples 

The condition of mergeability of interpretants states that synonymous signs must have 
equal denotations. With respect to programming languages this means that as soon as 
a variable changes its value, a new interpretant must be formed. It may be possible to 
bundle sequential changes of values. For example, if all values in an array are updated 
sequentially, it may be sufficient to assume one interpretant before the changes and one 
after the changes instead of forming a separate interpretant after each change. Some 
variables may also be ignored, such as counters in for-statements. This corresponds to 
the distinction between persistent and transient objects in object-oriented modelling. 
Transient variables do not initiate new interpretants. 
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( \ 

counter = 1 
print "game starts" 
while counter <= 5: 

number = input("please guess the number") 
if number = 5: 
print "good guess" 
break 
else: 

print "try again” 
counter = counter +1 
else: 

print "game over” 

V 

Fig. 2. A piece of Python code 



The significance of the following examples is not that this is just another application 
of formal concept analysis but instead that this kind of analysis is suggested by the 
semiotic conceptual framework. The theory about mergeability of interpretants suggests 
that a number of different interpretants are invoked by any computer program depending 
on when certain variables change their values. It just happens that formal concept analysis 
can be used to analyse this data. We believe that careful consideration of the predictions 
made by the semiotic conceptual framework can potentially provide interesting insights. 
But we have not yet explored this further. 

The example in figure 2 shows a piece of Python code and a corresponding concept 
lattice in figure 3. The contexts (or formalisable parts of interpretants) are initiated by 
the changes of the variables “counter” and “number”. Each context produces a different 
behaviour, i.e., a different print statement by the program. The lattice in figure 3 is 
modelled as follows: the objects are the observable behaviours of the program (i.e., the 
print statements). The attributes are the states of the variables which are relevant for the 
contexts. The formal concepts are contexts of the program. If the counter is smaller than 
5 and the user guesses the number 5, the program prints “good guess”. If the number is 
not 5 but the counter is smaller than 5, the program prints “try again” and “please guess”, 
except in the first instance (counter = 1 ) when it prints “please guess” after having printed 
“game starts”. If the counter is larger than 5 the game prints “game over”. 

In contrast to flowcharts, the lattice representation does not represent the sequence of 
the statements. Wolff & Yameogo’s (2003) temporal concept analysis could be applied 
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Fig. 3. A lattice of contexts for the code in figure 2 



to the lattice to insert the temporal sequence. The lattice shows the relationships among 
the contexts. For example, it shows that the start-context (counter = 1) and the contexts 
in which the wrong number was guessed share behaviour. This is not necessarily clearly 
expressed in the code itself. In fact our experience with teaching scripting languages to 
non-programmers has shown that students often have a problem comprehending where 
the ’print “please guess’” statement needs to be placed within the while loop so that it 
pertains to both the first iteration and to some of the later iterations. In the lattice this 
relationship is shown more clearly. 

It should be noted that we have not yet tested whether students can read the lattices. 
But we are not suggesting that lattices must be used directly as a software engineering 
tool. The information contained in the lattice could be displayed in a different format, 
which would still need to be determined. We have also not yet determined in how far 
such lattices can be automatically generated from code. We intend to investigate this 
further in the near future. 

The second example, which was taken from Ballentyne (1992) demonstrates the 
equivalence of Warmer diagrams (Orr, 1977) and concept lattices. The original data is 
shown in the upper left table in figure 4 and to be read as follows: action 1 has four crosses 
which correspond to the conditions A, -<B, ~^C or A, -<B, C or A , B , ->C or A, B, C. 
This is equivalent to condition A implying action 1 . Therefore in the formal context on 
the left there is a cross for A and 1 . Action 2 is conditioned by ->A and B which is 
visible both in the left table and in the formal context. After the whole context has been 
constructed in that manner, a second condition is to be considered which is that A and 
-i A and so on must exclude each other. This is true for A and B but ->C does not have 
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Fig. 4. Wamier diagrams and lattices 



any attributes and must be excluded from the lattice. A third condition is that any meet 
irreducible concepts in the lattice must not be labelled by an attribute because otherwise 
that attribute would be implied by other attributes without serving as a condition itself. 
For this reason, the temporary attribute t is inserted. The resulting lattice can be read 
in the same manner as the one in figure 3. The Warnier diagram corresponds to a set of 
paths from the top to the bottom of the lattice which cover all concepts. Obviously, there 
can be different Warnier diagrams corresponding to the same lattice. 

9 Conclusion 

A semiotic conceptual framework for formal languages combines conceptual reasoning 
and inference structures with semiotic modes, such as assertion, question and query. By 
considering the denotations of formal signs as formal concepts, structure is imposed. 




38 



Uta Priss 



Because denotations are both signs and can be mapped to formal concepts which are 
mathematical entities, denotations serve as boundary objects (Star, 1989) between the 
mathematical world and the pragmatic world of signs. The role of contexts is often 
neglected. This is understandable in mathematical applications because mathematical 
entities exist in more global contexts. But in other formal languages, which employ 
richer signs, contexts are frequently changing and cannot be ignored. If contexts are 
modelled as formal concepts, it is not necessary to invent any new structures but instead 
the mechanisms of formal concept analysis can also be applied to contexts. Contexts 
provide a means for managing signs and sign relations. Programming languages and 
databases already fulfill these functions, but so far not much theory has been devel- 
oped which explains the theoretical foundations of context management. A semiotic 
conceptual framework can provide such a theory. 
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Abstract. The development of the theory of Formal Concept Analysis 
has been accompanied from its beginning by applications of the theory to 
real-world problems. Those applications gave rise to the implementation 
of the software TOSCANA and the creation of TosCANA-systems. In this 
paper, we provide a mathematical model for these systems. This model 
- called Conceptual Data System - enables us to describe Toscana- 
systems and to discuss possible extensions in mathematical terminology. 



1 Introduction 



Since its beginning the development of the mathematical theory of Formal Con- 
cept Analysis has been accompanied not only by theoretical considerations but 
also by practical applications of the theory to real-world problems. During the 
last decade, one particular group of projects contributed greatly to the deploy- 
ment of Formal Concept Analysis. TosCANA-systems based on the software 
TOSCANA- showed lrow the theory can be used for instance for analyzing and re- 
structuring data or to retrieve documents from a given database. The main tasks 
of the software itself is to visualize diagrams, to calculate the realized extents of 
the displayed conceptual scales and to allow the user to select and to construct 
the view on the data he is interested in. While TOSCANA is for the most part an 
implementation of the theoretical concepts of conceptual scaling (cf. [GW89]), 
T OSCANA-systems themselves have developed a rich structure which makes them 
flexible and adaptable to many different situations. While parts of this structure 
have been described already very early in [VWW91, SSVWW93], a complete 
mathematical description of the system and its interface is not available. Our 
approach tries to combine the previous work and extend it using some ideas 
from one of the authors diploma thesis [Ka02] to provide a formal basis for dis- 
cussion about the structure, development, and extension of TosCANA-systems. 
We consider a mathematical model for those systems to be helpful to trans- 
port mathematical development to the applied side, and vice-versa to translate 
problems arising in real-world projects back to the theoretic level. 

P. Eklund (Ed.): ICFCA 2004, LNAI 2961, pp. 39-46, 2004. 

(c) Springer- Verlag Berlin Heidelberg 2004 
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2 Many- Valued Context and Conceptual Schema 

A TosCANA-system implements the idea of conceptual scaling [GW89] where a 
data table is modeled as a many-valued, context: 

Definition 1 (many-valued context). A many-valued context is a structure 
K := (G, M, U mS M W m , I), where G is a set of objects, M is a set of attributes, 
W : = UmeM Wm is a set of values and I C G x M x W is a ternary relation, 
where (g, m, wi), (g, m, wf) £ I => w± = W 2 £ W m . Every m £ M can be 
understood as a (partial) function from G to W with m(g) := w if (g, m, w ) £ I. 
By W m we denote the set of potential values for an attribute m, while m(G) is 
the set of actual values occuring in the context. 

A conceptual scale for a set of attributes of a many-valued context is defined 
as follows: 

Definition 2 (conceptual scale). Let K := ( G,M,W,I ) be a many-valued 
context and N C M. Then we call a formal context S jv := ( Gn,Mn,In ) con- 
ceptual scale if {(m(g))m£N \ g £ G} C Gn Q X m 6 jvlhm. We say Sat scales 
the attribute set N. A family of conceptual scales (§; vAjeJ scales IK if every 
conceptual scale §jv,- scales the attribute set Nj C M. 

Central to all TosCANA-systems are the diagrams that are used as interface to 
the data. To the user, the conceptual scale and the corresponding diagram appear 
as one entity. Mathematically however, we differentiate between the already 
introduced conceptual scale and its geometrical representation. 

Definition 3 (diagram map). If (P, <) is an ordered set, P is finite, and -< 
denotes the lower neighbour relation for <, we call an injective mapping 

A : PU -<— > R 2 U <P(]R 2 ) 

diagram map if 

• p £ P => A (p) £ R 2 , 

• Pi < Pi => A(pi)| 2 < A {jp 2 ) 1 2 5 and 

• (Pi,P 2 ) £^ => A(pi,p 2 ) := {r-A(pi) + (1 - r)A(p 2 ) | r £ M and 0 < r < 1}. 

The image A(PU -<) of a diagram map represents a line diagram of the 
ordered set (P, <). The image A (P) is the set of all points and A(-<) is the set 
of all line segments of the line diagram. If P = IB (IK), we can assign a labeling 
(G c ,M c ) c6 < 8 (k) where ( G C ,M C ) £ ^(Gk) x ^P(Mk) to a diagram map A. The 
labels can be attached to the corresponding points v £ A(* 8 (IK)) using A. The 
label (G C ,M C ) is attached to the point A(c). A simple way to assign a labeling 
to A is (c) ce rg(K)- In TosCANA-systems it is common to label attributes reduced, 
i. e. to list only contingents. With 7 we refer to the object concept mapping from 
the Basic Theorem on Formal Concept Analysis and with g to the attribute 
concept mapping. Then we define 

(Ext(c), /-C 1 (c)) c g< 8 (K) 

as complete labeling. The reduced labeling is defined as 
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(7 1 (c) , At 1 (c))cG' 8(IK)- 

The process of developing a TosCANA-system is an iterative interdisciplinary 
task where a discussion between domain experts and Formal Concept Analysis 
experts yield conceptual scales for a database of interest. The result of this 
process is a conceptual schema which will be defined mathematically in the 
following. 

Definition 4 (conceptual schema). Let be a family of conceptual 

scales and let (A j)j£j be a family of diagram maps where dom(Xj) = *B(§jv)U ~<j. 
Then we call the vector S := (§jv-,A j)jej conceptual schema. We say that a 
conceptual schema S and a many-valued context IK are consistent if (Sj v<)jej 
scales IK. 

Our formalization of the conceptual schema adapts the conceptual file model 
from [VWW91] (the term was changed to conceptual schema since the infor- 
mation is not neccesarily stored in a single hie). The information neccesary to 
connect the database is very subtle modelled by the sets This reflects 

the fact that in TosCANA-systems one cannot change the column names in a 
data table without adapting the queries for the diagrams. Note that the object 
set of a conceptual scale is contained in X m gjv, W m according to Definition 2. 
Therefore the labeling for a diagram map A j contains in the object part tuples 
of potential attribute values. 

With the preceding definitions we have formalized the basic ingredients of a 
conceptual data system. Next, we will discuss a description of the interface. 

3 Conceptual Interface and Conceptual Data System 

With the notion conceptual interface we will construct a mathematization of 
the interface of the system, which is usually provided by the TosCANA-software. 
The TosCANA-software uses the information from the real-world counterpart of 
a conceptual schema to present the data to the user, and it allows the user to 
interact with the system. A conceptual interface, a conceptual schema, and a 
many-valued context will form the components of a conceptual data system. The 
following definitions will be put together in the end. We start describing the 
connection between many-valued context and conceptual scales. 

Definition 5 (realized scale). Let S := (SjVj , Xj)j^j be a conceptual schema 
consistent with IK := (G, M,W, I) and let j £ J. For^Nj '■= 
the realized scale is defined as S '■= (G, M^ } , 1^.) with (g,m) £ I r N . : 4=t- 
{[n{g)) neNj ,m) £ I Nj . 

It is important to note that an object h £ Gjsf. can be non-realized if there 
is no g £ G meeting the attribute value combintaion prescribed by h. The 
object clarification of a realized scale S r N . is isomorphic to the subcontext S N j := 

(G Nj , M Nj , I Nj n (Gjvj x M Nj )) of SjVj with 

Gnj {h £ Gjsij\^g £ G : {m{g)) m QNj = h}. 
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Therefore, we can embed the lattice 23 (S r ) in 23(S), identifying concepts by their 
intents, using the following theorem from [GW99, p. 98]: 

Theorem 6. For H C G, the map 

P : <8(JJ, M,I C\ H x M) — * <B(G, M, /) 

(A,B) — 

is a V -preserving order- embedding. 

If this embedding /3 is not surjective, non-realized concepts will occur. These 
are concepts of S which have no preimage under the natural embedding /3. Let 
c := (T, B) be in 23 (S). One can check whether c is non-realized by deriving B Ir . 
If ( B Ir ,B ) is not an element of 23(S r ) the concept c is non-realized. 

Definition 7 (non-realized concept). We call a concept c € 23 (S) non- 
realized in »(§ r ) if c /3(«(S r ))- 

If a conceptual scale Sat scales an attribute set TV of a many-valued context 
IK and for every h € Gn there exists an g G G with (m(g)) me jv = h, we call S r 
completely realized. Discussions of non-realized concepts and their relevance for 
local scaling can be found in [St96, Sc98]. If a conceptual schema S and a many- 
valued context IK are consistent we can define additional notions of labelings. Let 
■= (Gnj , M tv,,- , I Nj) be a scale of a conceptual schema S. Realized labelings 
for the corresponding diagram map A j : 23(§/v j )U -< — > R 2 U tp(K 2 ) are subsets 
of i)3(G) x The complete realized labeling is defined as 

(( I nt(c)) /N G/r' 1 (c)) c6 < 8 ( Sjv .) 

and the reduced realized labeling is defined as 

((Int(c))^ \ U(Int(0)^,M- 1 (c))c e «8(S Nj) - 

c'<c 



3.1 Zooming 

In a TosCANA-system, the user can select a subset of the available objects for 
further analysis by double-clicking on a given concept. This is called zooming. 
More precisely, the user changes the state of the system into another state, 
enabling the system to produce a new diagram. The state of a TosCANA-system 
is given by the selected scales and the chosen object sets for filtering. We formally 
define: 

Definition 8 (state, initial state). Let S := (§jv,-,A j)jej be a conceptual 
schema consistent with a many-valued context IK := (G, M,W, I). Then a state 
is a tripel s := (er, Fi, F 2 ), where F\ C F 2 C G and a := (ji)f =1 with ji € J for 
i € {1, ...,n} and i ^ k => ji ^ jk- F\ is called exact zooming filter and F 2 is 
called full zooming filter. An initial state is a state where F\ = F 2 = G. 
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The scales identified by a are those that are displayed to the user. The number 
of these active scales coincides with the depth of nesting n. To allow the user to 
switch between exact and full zooming filter, we maintain both sets, F\ and P 2 . 

Formally, zooming maps states to states, depending on the user input which 
consists of a concept and a next diagram to zoom into. In a real application this 
can be a mouse click on a node of the line diagram displayed and a choice of a 
certain number of diagrams of interest, for instance the next diagram to zoom 
into. 

We need to recall a definition from [GW99]: 

Definition 9. We call Ki|IK 2 := (Gi, MiUM 2 , JiU/ 2 ) the apposition of K 1 
and K 2 . 

Definition 10. We define zooming as the mapping 



C : (si,c,j) 1 — * s 2 

where Si := (cr, Pi,P 2 ) and s 2 := (cr', F[, F!f) are states, j G J, and c € 
®( {: Then 

F[ := Fi n 7 - 1 (c), 

and 

F ' 2 := P 2 D Ext(c). 

For a := (ji)F=i? new se t a ' °f active scales is a' := {ji)i=2 where j n + 1 = j- 

We call a state valid if it is an initial state or the result of the zooming 
operation with a valid state as input. After having formalized zooming, we turn 
to the more sophisticated (nested) diagram display. 

3.2 Diagram Display 

An important aspect for TosCANA-systems is the feature to combine several 
predefined diagrams into one more complex view onto the data by nested line 
diagrams. To describe this mathematically, we introduce an operation between 
diagram maps: 

Definition 11 (©). Let (Pi,<i) and (P 2 ,< 2 ) be finite ordered sets with lower 
neighbour relations -0 and A 2 ; furthermore let -<i 2 be the lower neighbour rela- 
tion of the direct product (Pi, <1) x (P 2 , < 2 ). For a diagram map X\ of (Pi, <1) 
and a diagram map A 2 of (P 2 , < 2 ), positive reals s, r\, and r 2 can be chosen such 
that ri min{||v — ie|| | v, w € Ai(Pi)} > s and r 2 max{||u|| | v € A 2 (P 2 )} < s. For 
such real numbers, a diagram map A := Ai 0 A 2 : Pi x P 2 U Ai 2 — > K 2 U fp(R 2 ) 
exists with 

X(pi,p 2 ) := riAi(pi) +r 2 A 2 (p 2 ) and 
A((pi,P 2 ), ( 91 , 92 )) := {r{pi,P2) + (1 -r)( 9 i, 9 2 ) |r € R with 0 < r < 1}. 



Proposition 12. The class of diagram maps is closed under the operations 0 
defined in Definition 11. 




44 



Joachim Hereth Correia and Tim B. Kaiser 



Proof. Let A := Ai © A2. We have to show 

(pi,P2) < ( 91 ,^ 2 ) => A(pi,p 2 )| 2 < A(9!,9 2 )| 2 . 

Assume (pi,p 2 ) < ( 91 , 92 )- Then 

A(pi , P2) 1 2 = (riAi(pi) + r 2 A 2 (p 2 ))| 2 = riAi(pi)| 2 + r 2 A 2 (p 2 )| 2 
< ^ 1 Ai (91 ) 1 2 + r 2 A 2 (9 2 )| 2 = (riAi(9i) + r 2 A 2 (9 2 ))| 2 = A(gi, 9 2 )| 2 . 

It remains to show that A is injective. If X(pi,p 2 ) = \(qi,q 2 ), then 

riAi(pi) + r 2 A 2 (p 2 ) = riAi(9i) + r 2 X 2 (q 2 ) 

which can be transformed to 

r i(Ai(pi) ~ ^1(91)) = r 2 (A 2 (p 2 ) — A 2 (g 2 )). 



This implies 

?r||Ai(pi) ~ Ai( 9 i)|| = r 2 ||A 2 (p 2 ) - A 2 ( 9 2 )||. 

If Pi 7^ 91, the injectivity of Ai and the choice of r 1 force the left, hand side of 
the above equation to be greater than s. Therefore p 2 7 ^ 9 2 , because otherwise the 
right hand side would be 0 . The choice of r 2 implies that the right hand side of 
the above equation is less or equal than s. It follows that pi = 91 which implies 
P2 — 92, because A 2 is injective. □ 

From the viewpoint of Formal Concept Analysis a TosCANA-system visu- 
alizes the concept lattice of the apposition of the participating realized scales 
embedded into the direct product of the concept lattices of the corresponding 
conceptual scales. Because the object set of different realized scales of the same 
conceptual schema is equal, we maintain the object set and form the disjoint 
union of attributes and incidence relations for combining two scales. Proposition 
31 and Theorem 7 from [GW99, p. 98, p. 77], guarantee that an order embed- 
ding of ®(§i|S 2 ) into * 8 (Si) x * 8 (S 2 ) is always possible. After summarizing the 
theoretical background , we give the definition for a diagram display. 

Definition 13 (diagram display). Let S be a conceptual schema consistent 
with a many-valued context K and s be a valid state of the former. A diagram 
display is a mappinq 

6 : (K,«S,s) 1 — + A 

with A := ©JLiA*. 

We have to explain how a diagram map resulting of the operation 0 can be 
labeled. Let Ai and A 2 be diagram maps with Pi := 53 (Si) and P 2 := 03 (S 2 ). 
For c = (ci, c 2 ) € 23(Si) x 03(S 2 ) we abbreviate Int(ci) Ulnt(c 2 ) by writing Int(c) 
and /z - 1 (ci) U /r - 1 (c 2 ) by writing /z - 1 (c). Let s := (a, Fi,F 2 ) be a valid state 
of the system. Then we can distinguish four different labelings. In the following 
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I[ 2 denotes the incidence relation of the formal context The complete 

s-realized labeling is defined as 

((F 2 D Int(c) /;t , 2 , p 1 (c)) ce <8(s 1 )x < 8(§ 2 )- 

This labeling corresponds to the setting all documents in TOSCANA 2 and 
TOSCANA 3 and to the Show all matches - Filter: use all matches setting in 
ToscanaJ. If we use a reduced object labeling, we get the special s-realized 
labeling 

((F 2 n (Int(c)T ,2 \ (J (Int(c') /l ’ 2 ),A t 1 ( c ))ce'8(Si)x*8(S 2 )- 

c'<c 

This labeling corresponds to the setting special documents in TOSCANA 3 and 
to Show only exact matches - Filter: use all matches in ToscanaJ. If we now 
apply the smaller filter Fj we get a exact s-realized labeling , which corresponds 
to the setting exact documents in TOSCANA 3 and to Show only exact matches 
- Filter: use only exact matches in ToscanaJ: 

((-Pi n (Int(c) J i ,2 \ (J (Int(c , ) / c 2 ), /z _1 (c)) ce <8(g l)x fg ( g 2 ). 

c'<c 

The fourth combinatorical possibility is seldom used in TOSCANA systems, 
but we mention it here for the sake of completeness: 

((-Pi D Int(c) Jl ,2 , ^ _1 (c)) ce <8(s 1 )x»(s 2 )- 

Constructing the object parts of the labeling by deriving the concept intents 
via the relation I{ 2 i we obtain an order embedding of the concept lattice of the 
apposition of the realized scales into the direct product of the concept lattices 
of the conceptual scales. The derivation operator Jl - 2 : fp(Mi U M 2 ) — > tp(G) 
models the database queries. 

3.3 Final Formalizations 

Now we can give a compact definition of what can be understood as a conceptual 
interface: 

Definition 14 (conceptual interface). Let K be a many-valued context con- 
sistent with the conceptual schema S. A conceptual interface is a pair X := 
( J ,<F) such that for every subset L of the index set K of the conceptual schema 
S, the operator Jl is the derivation operator for the formal context and 

the set (L contains at least the essential mappings S for diagram display and f 
for the zooming operation. 

The next definition puts all the parts together. 

Definition 15 (conceptual data system). Let IK be a many-valued context 
consistent with the conceptual schema S and X a conceptual interface. Then 
CVS := (IK,<S,Z) is called a conceptual data system. 
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4 Conclusion 

We have formalized the core part of all existing TosCANA-systems by the no- 
tion of a conceptual data system. With this formalization we provide means 
for describing the interface of a TosCANA-system formally as well as its inter- 
action with the conceptual schema and the database. We give mathematical 
descriptions for the different labeling methods and summarize the mathemati- 
cal theorems necessary to illuminate the conceptual meaning of the diagrams. 
This framework may help in investigating and discussing different extensions 
for TosCANA-systems mathematically, for instance in course of the TOSCANA J 
project (cf. [BH03]). 
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Abstract. BLID (Bio-Logical Intelligent Database) is a bioinformatic 
system designed to help biologists extract new knowledge from raw 
genome data by providing high-level facilities for both data browsing 
and analysis. We describe BLID’s novel data browsing system which is 
based on the idea of Logical Information Systems. This enables combined 
querying and navigation of data in BLID (extracted from public bioinfor- 
matic repositories). The browsing language is a logic especially designed 
for bioinformatics. It currently includes sequence motifs, taxonomies, and 
macromolecule structures, and it is designed to be easily extensible, as 
it is composed of reusable components. Navigation is tightly combined 
with this logic, and assists users in browsing a genome through a form 
of human-computer dialog. 



1 Motivation 

Over the last decade many organisms have had their genomes fully sequenced. 
For example, the 17 chromosomes of the Baker’s Yeast ( Saccharomyces Cere- 
visiae ) have been sequenced, and they code for about 6000 proteins [Gof97]. 
Yeast is one of the best studied of all organisms, yet about 30% of all its pro- 
teins still have not yet any known function. For other organism the percentage 
is higher. Therefore, one of most important current problems in biology is to 
discover the function of these proteins that are currently unknown, and to bet- 
ter understand the function of those that are putatively known. To help do this 
biologists require new and powerful tools to browse and compare bioinformatic 
databases, and so extract the wealth of information hidden in them. 

Many bioinformatic databases are publicly available: e.g., the whole genome 
of the Yeast is accessible from MIPS 1 . Also, many tools are available: e.g., PSI- 
BLAST for comparing sequences, ExPASy for computing physical properties of 
proteins. However, these data sources and analysis tools are disconnected from 
each other, making it very difficult to perform genome-wide analysis. Moreover, 
they usually offer limited forms of querying and navigation. 

* This project is funded by the BBSRC grant 21BEP17028. 

1 Munich Information center for Protein Sequences, http://mips.gsf.de/ 
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Our aim is to provide biologists with a high-level and integrated interface 
for browsing and analyzing a whole genome. To do this we first must build a 
secondary database gathering data from different sources, and represent them 
in a uniform way. We then must define a querying language that fits the needs 
of bioinformatics, and allows browsing capabilities. In this paper, we focus on 
the second task. This language can deal with taxonomies of protein functions, 
with complex sequence patterns (as in Prosite), and with structures (e.g., the 
transcription of proteins from several RNA parts called exons). The need for 
complex representations and reasoning mechanisms leads us to the use of log- 
ics specialized to bioinformatics. Hence the name of our system, BLID, which 
stands for Bio-Logical Intelligent Database. The term “intelligent” refers to the 
automated analysis, such as machine learning or data-mining, that will be made 
available on top of the querying system in the future. 

Section 2 discusses the use of Formal Concept Analysis (FCA) for bio-logical 
browsing , and presents Logical Information Systems (LIS) as a theoretical frame- 
work for BLID. Section 3 presents a logic for the representation and reasoning 
of descriptions and queries. Section 4 explains and illustrates how an automatic 
and non-hierarchical navigation can be combined with logical querying. Finally, 
Section 5 discusses related works, and Section 6 draws some future directions, 
especially w.r.t. analyses. 



2 Concept Analysis and Logical Information Systems 

Formal Concept Analysis (FCA) is a mathematical theory based on ordered 
sets and complete lattices [Wil82]. A context is a triple (0,2 A ,d), where O is 
a set of objects, A a set of attributes , and d is a mapping from objects to their 
description, i.e. a set of attributes. Then, a Galois connection {ext, int) is defined 
between sets of objects and sets of attributes. For every set of attributes A, its 
extent ext{A ) = (o £ O \ d{o) A A} is defined as the set of objects whose 
description contains A (i.e., the answers of A, when A is seen as a query); and 
for every set of objects O, its intent int(0) = PloeO ^(°) is defined as the set 
of attributes shared by all objects in O. Pairs of related extent and intent, 
such as ( ext(A),int{ext(A ))) or {ext{int{0)),int{0)), are called concepts, and 
form together a complete lattice of concepts when ordered by set inclusion on 
their extent (or equivalently on their intent). Numerous works have shown the 
usefulness of this concept lattice for information retrieval combining querying 
and navigation [GMA93,FR03], learning and data-mining [GK00,FR.02b]. 

This applicability of FCA to information retrieval and learning is the basis 
for our choice of its use as a theoretical foundation. In BLID, objects are the 
ORFs 2 of some organism (the Yeast in the rest of this paper). However, simple 
sets of attributes are not an expressive enough language for object descriptions 
and queries. For example, a protein sequence can not be made an attribute as it 

2 ORFs (Open Reading Frames) are segments of DNA in chromosomes supposed to be 
transcribed and translated into proteins. An ORF coincides with the coding region 
of a gene when this protein has directly been observed. 
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is different for each gene. A set of predefined patterns could be used instead (e.g., 
from Prosite), but information about the gene would be lost, and querying with 
new patterns would no longer be possible. We wish to preserve all information 
and to dispose of an open query language. 

Logical Concept Analysis (LCA, [FR03]) is an extension of FCA that allows 
for the replacement of sets of attributes A £ 2-^ by formulas f £ L of a logic. The 
formulas need only be ordered by a subsumption relation C, and this must form a 
lattice. Logical Information Systems (LIS, [FR03]) are founded on LCA and are 
characterized by: (a) an object-centered representation; (b) a tight combination 
of querying and navigation; (c) a logical representation of object descriptions, 
queries, and navigation links; (d) genericity in the logic for customization. 

Section 3 describes the building of a logic that is designed specifically for 
BLID. This comprises the definition of the language, as well as a few neces- 
sary operations: (1) the subsumption C for ordering formulas according to their 
specificity/generality 3 (/ C g means / is more specific than g), (2) the conjunc- 
tion n, and (3) the function feat that maps each object description to a set of 
more general formulas, their features. These features play an important role in 
navigation, which is presented in Section 4. They are generated mainly automat- 
ically by the operation feat, but can also be introduced at any time manually by 
users according to their needs. 

3 Customized Logic and Querying for Bioinformatics 

A logical context is build by creating an object for every ORF of the Yeast’s 
genome. Each object/ORF is described by collecting information from vari- 
ous data sources (see Section 1). For instance, a partial description of the 
ORF YAL003w, incorporating various data types, is (sequences are shortened): 

[ name is "YAL003w", nb_atoms = 3138, mol_weight = 22.627e3, 
seq is MAS[..]QKL, struc is c (8)a(12) c (3) [ . . ] b(10) c (5)a(25) c , 
some exon is [1,80], some exon is [447,987], 

’mf c05/04/02 : elongation’ ]. 

This description combines different concrete domains: text (name), integer 
(number of atoms), float (molecular weight), two kinds of sequences (over amino- 
acids and 3D structures), segments (exons). The first sequence (attribute seq) is 
made of amino-acids, and defines the protein expressed by the ORF. In solution 
the protein folds to form a specific 3D-slrape. This shape, the tertiary structure, 
is still generally unknown, but it is possible to reliably predict an intermedi- 
ate structure, the secondary structure [OKOO]. This latter structure (attribute 
struc) is represented as a sequence composed of 3 kinds of structure element 
(helices a, sheets b, and connecting elements called coils c), which can have dif- 
ferent lengths (given between brackets after each structure element). The exons 

3 Notice that the left argument in the subsumption relation is not restricted to be an 
object description, but can be any query as well as the right argument. This makes 
it much harder to define such logics, but is necessary for navigation. 
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are the gene segments, from which an ORF is composed. The last term in the 
description is an element of the taxonomy of functional classes, found in MIPS. 
(This taxonomy has been used as target classes in machine learning [KKCD00].) 
Let the following expression be a query: 

(’mfc05: PROTEIN SYNTHESIS’ or ’mfc06: PROTEIN FATE’) 
and some exon start >= 2 and nb_atoms in 3000.. 4000 and 
seq match N-{P}- [ST] — {P} and not name ends with "w", 

where the pattern N-{P}- [ST] -{P} is the Prosite motif PS00001, which is de- 
scribed as “N-glycosylation site” . While none of the terms of this query appears 
as such in the description of YAL003w, the latter is still an answer of the query. 
This means that propositional logic is not expressive enough as a query lan- 
guage w.r.t. above description. One could wonder if propositional logic could be 
made suitable by adapting the object descriptions. The answer is no, because 
some data types have an infinite number of patterns (e.g., numerical intervals, 
sequence motifs); and even for data types where this adaptation is possible (e.g., 
finite taxonomy of functional classes), this would imply redundancy and poten- 
tial exponential growth of the descriptions. 

Our logical language for descriptions and queries can be understood as a 
propositional logic, whose atoms are replaced by logical features belonging to 
fragments of predicate logic. In fact, this is equivalent to saying that our logic is 
a controlled fragment of predicate logic, plus some theory about the considered 
data types. For instance, the logical feature seq match N-{P}- [ST] -{P} can be 
translated into predicate logic by: 

\/Orf : 3Start, PI, P2, A2, P3, A3, PA, A4 : seq{Orf, Start)A 
somesucc{Start, PI) A aa(Pl, ’N’) A succ{P 1, P2) A aa(P2, A2)A 
A2 ± ’P’ A succ(P2, P3) A aa(P3, A3) A (A3 = ’S’ V A3 = ’T’)A 
succ(P3, PA) A aa(P4, A4) A A4 ^ ’P’, 

given some theory to define the predicate somesucc as the transitive closure of 
predicate succ. It should be clear from this example that a customized logic 
is preferable to predicate logic as a query language. This is more than mere 
syntactic sugar because the use of specialized logics enables us to make the 
computation of subsumption decidable, simpler, and more efficient (remembering 
that subsumption needs to be applied between queries as well). 

Building such a logic from scratch would be a tedious task because of the 
number of different concrete domains. Moreover, this would make it difficult to 
extend, or to reuse, parts of existing customized logics. In order to favor modu- 
larity and re-usability we apply the principles of logic functors [FR02a] , which 
enable us to build complex logics by simple composition of smaller logic compo- 
nents, the logic functors. Essentially, a logic functor is a function from logics to 
logics, where logics are modeled as abstract types encapsulating both represen- 
tation and reasoning. Most logic functors are reused from previous applications, 
and a few others, specific to bioinformatics, are created (e.g., for protein se- 
quences and Prosite motifs). Due to limited space, we do not give in this paper 
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formal definitions for the language, the semantics, and the subsumption of logic 
functors. For those interested, they are available for a few functors in [FR02a]. 

Figure 1 shows the way the logic functors are composed as a tree. Each node 
is a logic functor that is applied to the logics composed from its sub-nodes. At the 
root of the tree we recognize the propositional parts of the logic. The remainder of 
the tree describes the logic of features that replace the usual atoms. Features are 
used to represent both object descriptions and query terms. They are essentially 
conjunctions of terms taken in concrete domains. The functor Sum allows us to 
easily combine any number of concrete domains, and facilitate extensibility of 
the logic. Finally, the functor AIK (named after the epistemic logic All I Know ) 
enables to apply the Closed World Assumption on object descriptions [FR02a]. 

Attr(name,gene, . . . ) 

" String 

Attr(nb_atoms,...) 

Interval Int 

Attr(mol_weight, . . .) 

- Interval Float 

Attr(seq) 

- Motif AA 

Attr(struc) 

" Motif SS 

Pair ' Attr (exon) 

Segment — Int 

' AttrQ 

Fig. 1 . The BLID’s logic represented as a tree of logic functors. 

The logical context of chromosome A contains 108 objects (ORFs), from 
which 4867 features are extracted. This makes an average of 161 features per 
object, and 3.5 objects per feature (feature sharing). As this context is extended 
to the whole genome (6141 ORFs), the number of features per object remains 
constant, and the sharing increases, which results in a total number of features 
of around 60,000. In such a large context, it becomes intractable to compute 
the concept lattice. However, it is important to provide users with navigation as 
they cannot remember by heart the function names or the Prosite motifs, and 
also because, given some previous query, it is difficult to guess relevant features 
to refine it. Section 4 develops an interactive and incremental way of building 
such queries: logical navigation. 



Prop — AIK — Conj — Sum 




4 Logical Navigation 

The idea of navigation is to help users build their queries, and to enable them 
to form overviews on the data. In the domain of concept analysis, navigation 
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is usually realized by a direct browsing of the concept lattice, or some part of 
it [GMA93]. However, this lattice becomes rapidly very large, and we prefer to 
realize navigation by a form of human-computer dialog [FR03], as this gives 
more freedom for controlling the amount of answers. 

To navigate from one query/concept to another, users specify query incre- 
ments with exclamatory commands, such as ! name ends with "w", and they 
get suggestions for increments with the interrogative command These sug- 
gestions are found among the features that have been automatically extracted 
from object descriptions by the logical operation feat. For example, in the context 
made of all ORFs of chromosome A, this command gives the following result: 



[1] ? 






What is there ? 


too 


! 


struc 


100 ORFs with known 2nd structure ! 


101 


! 


’MIPS function’ 


1 01 ORFs with function ! 


108 


? 


name 


What kind of name ? 


108 


? 


some exon 


What kind of exon ? 


108 


? 


seq 


What kind of sequence ? 


108 


? 


mol_weight in . . 


What kind of mol. weight ? 


108 


? 


nb_ atoms in . . 


What kind of nb. of atoms ? 


108 obj 


ect (s) 


There are 108 selected ORFs. 



The system returns not only exclamatory suggestions (query increments), 
but also interrogative suggestions. These can be understood as “questions as 
answers to questions”, and their purpose is to provide more concise answers, 
as without them many exclamatory suggestions would possibly replace each in- 
terrogative suggestion. These are called view increments, because they allow 
to focus on one kind of features. For instance, the user can select the com- 
mand “? nb_atoms in . in order to focus on the number of atoms: 



[2] ? nb_atoms in . . 

5 ! nb_atoms = 2**** 

22 ! nb_atoms = 1**** 

81 ! nb_atoms = 0**** 

108 object (s) 



What kind of nb. of atoms ? 
5 ORFs with nb. of atoms in [20000, 30000 [ ! 
22 ORFs with nb. of atoms in [10000, 20000[ ! 
81 ORFs with nb. of atoms in [0, 10000 [ ! 

There are 108 selected ORFs. 



The formula nb_atoms = 2****, which means the number of atoms is com- 
prised between 20,000 and 29,999, is a feature automatically generated by the 
functor Int to make the navigation more progressive than a flat set of values. 
This makes the answers look like a histogram, as values at the left of increments 
are the number of objects they would select (support). With query languages 
such as SQL or Prolog, one would either get a flat list of all ORFs along with 
their exact number of atoms, or have to ask an aggregative query for all relevant 
intervals; which are difficult to know without prior knowledge of the range and 
the scale of the attribute (which can change according to the working query) . 

Coming back to command [1] , we see that the feature ’MIPS function’ ap- 
pears as a query increment, because it is not supported by all objects. However, 
we would expect it to be as well a view increment, focusing on the functions of 
ORFs. In fact, exclamatory suggestions can often be combined with an interrog- 
ative command. 
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[2] !? -1 -i ’MIPS function’ Select ORFs with function ! What kind of 

functions ? 

22 ! ’mfcOl: METABOLISM’ 

3 ! ’mf c02: ENERGY’ -> ’mfcOl: METABOLISM’ 

[■■] 

39 ! ’mf c99: UNCLASSIFIED PROTEINS’ 

101 object (s) There are 101 selected ORFs. 

The 101 objects are selected and functional classes are listed in lexicograph- 
ical order (option -l). This shows that about 40% of ORFs are unclassified. 
Option -i displays contextual implications between suggested increments. This 
enables the user to discover that every ORF in chromosome A that has an en- 
ergetic function, has also a metabolic function. 

5 Related Work 

Our logics are similar to Description Logics (DL, [Bra79]), in the sense that for- 
mulas are variable-free, and their semantics is based on object sets rather than 
on truth values. Our attributes are equivalent to functional roles, and our oper- 
ator some corresponds to the existential quantification. The two key differences 
are the modularity of logic functors, and our focus on concrete domains. Further- 
more, it would be possible to define a logic functor implementing a description 
logic in which atoms could be replaced by formulas of concrete domains; as it has 
been done with propositional logic. Both DL and our logics could be translated 
into predicate logic, which is more expressive; however they are more readable, 
and allow for logical navigation thanks to their compatibility with Logical Con- 
cept Analysis. We are also developing in parallel a querying interface in predicate 
logic (using Prolog) to offer more expressive power to expert users, but at the 
cost that no navigation is provided. 

A project related to ours is GIMS [CorOl], which aims at providing querying 
and analysis facilities over a genome database. In this project, simple queries can 
be built incrementally by selecting attributes and predefined value patterns in 
menus. Canned queries are made available for more complex queries and analysis. 
We differ in that we have made the choice to give users an open language, 
knowing that navigation will be available to guide users; even if they have no 
prior knowledge. It is difficult, if not impossible, to forecast all types of queries 
that may be of interest in the future. 

6 Future Work 

Our future work concentrates on providing analysis facilities in addition to query- 
ing and navigation. The kind of analysis we are mostly interested in is to dis- 
cover by machine learning techniques rules that predict the biological functions 
of ORFs from genomic data (i.e., functional genomics [KKCD00]). A first step 
will be to integrate existing machine learning techniques in BLID. Propositional 
learners (e.g., C4.5, concept analysis [GK00]) expect kind of attribute contexts, 
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which can easily be extracted from the BLID’s logical context by making each 
feature (e.g., sequence motifs) a Boolean attribute. Inductive Logic Programing 
(ILP, [MR94]) expects a representation of examples in predicate logic, which can 
always be obtained by translating them from our specialized logics. 

Ultimately, BLID could be made an Inductive Database [dR02] by unifying 
various machine learning and data-mining techniques under a unified inductive 
query language. For instance, such a language could allow to ask for “all most 
general rules predicting whether a protein is involved in metabolism according 
to its sequence” . Such a high-level language would be very helpful to biologists 
and bioinformaticians, who strive to relate genomic data to biological functions. 

A LIS executable for Unix/Linux can be freely downloaded at 
http : / /users . aber . ac . uk/ sbf / camelis. 
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Abstract. In this talk I shall relate some of my experiences in teaching 
lattice theory and the theory of ordered sets to undergraduates since 
1975. I will show how Formal Concept Analysis can be used as a unifying 
and motivating example in many parts of the theory. 



1 The Way It Was 

I have taught lattice theory to undergraduates at La Trobe university since 
1975. I have also had the opportunity to present one-semester courses at Oxford 
university and Monaslr university. Over a period of 13 years, I refined my notes 
and the exercise sets. Each year, new exam questions were incorporate into the 
exercises. The wording of exercises, which caused confusion amongst students, 
were altered and, where necessary, appropriate hints were added. In this way I 
produced a set of notes which provided an excellent basis for a course pitched 
at typical second or third year mathematics students. In 1984, Hilary Priestley 
came to La Trobe to work with me and took a copy of my “ Lattice Theory ” notes 
back to Oxford with her. She modified them by including more information on 
Boolean algebras and used them for a course she was teaching to undergraduates 
in theoretical computer science and mathematics. Meanwhile, I continued to 
expand my version. In 1987 we proposed to Cambridge University Press that we 
combine our two sets of notes into a text book. Hilary returned to La Trobe to 
continue our research in 1988. During this visit we agreed on the overall structure 
of the text. On her return to Oxford, writing of the first edition of Introduction 
to lattices and order [1] began in earnest. 

The structure of the first edition was strongly influenced by the request from 
our editor, David Tranah, that we make the book as attractive as possible to 
computer scientists: we wanted to write a text book but he wanted to sell one! 
We spent a lot of time working through unpublished notes and manuscripts on 
computer science. As we remarked in the preface, “. . . course notes by Dana 
Scott, Samson Abramsky and Bill Roscoe enticed us into previously unfamiliar 
territory David Tranah also asked that the computer science material be 
as early as possible in the text and as a result, Chapters 3 and 4, on CPOs and 
Fixpoint theorems, were added. 

In order to keep the price within the reach of our students, we set ourselves 
a strict page limit of 256 pages. Even with the inclusion of the new material in 

P. Eklund (Ed.): ICFCA 2004, LNAI 2961, pp. 55-56, 2004. 

(c) Springer- Verlag Berlin Heidelberg 2004 
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Chapters 3 and 4, this allowed us a closing 16-page chapter. We made a decision 
that was considered by some of our colleagues to be rather radical. As our final 
chapter, we included an introduction to the basics of Formal Concept Analysis, 
a topic, if not in its infancy, then in its early adolescence. This decision has been 
more than vindicated over the years. 



2 The Way It Is 

Between 1990 and 2000, both authors continued to teach subjects based on the 
text. The chapter on FCA was very soon promoted by both of us (independently) 
to as early in the semester as possible. Our typical undergraduate courses con- 
sisted, in order, of Chapter 1, half of Chapter 2, Chapter 11 (the FCA), followed 
by Chapters 5 to 8. The remaining chapters (3, 4, 9 and 10) consist of more ad- 
vanced material taught in the fourth year to students doing an honours degree. 

By 2000, the fourth printing of the first edition had sold out and CUP re- 
quested a second edition. Based on our experience of teaching from the text 
between 1990 and 2000, we decided to reorder the chapters of the text and to 
present them in the order that we taught them. So we find in the second edition 
of Introduction to lattices and order [2], published in 2002, that Formal Con- 
cept Analysis has been promoted from Chapter 11 to Chapter 3. This has the 
distinct advantage that topics introduced earlier, such as join-irreducibility and 
join-density, see immediate applications in the Fundamental Theorem of Concept 
Lattices and other topics introduced in later chapters, such as Galois connec- 
tions and the representation of finite distributive lattices, can be motivated by 
the results on concept lattices. The chapter on FCA in the second edition also 
includes a natural algorithm for finding all concepts of a context. While there is 
software that will do this for us, we believe that it is important that students get 
their hands dirty with some small examples before handing over responsibility 
to the technology. 

In this talk I will give a number of examples that show how Formal Concept 
Analysis both reinforces and leads naturally to other important topics within 
lattice theory. 
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Abstract. Mail- Sleuth is a personal productivity tool that allows in- 
dividuals to manage email and visualize its contents using line diagrams. 
Based on earlier work on the Conceptual Email Manager (Cem), a major 
hypothesis of Mail-Sleuth is that novices to Formal Concept Analysis 
can read a lattice diagram. Since there is no empirical evidence for this 
in the Formal Concept Analysis literature this paper is a first attempt 
to test this hypothesis by following a user-centred design and evaluation 
process. Our results suggest that, with some adjustments, novice users 
can read line diagrams without specialized background in mathemat- 
ics or computer science. This paper describes the process and outcomes 
based on usability testing and explains the evolution of the Mail-Sleuth 
design responding to the evaluation at the Access Testing Centre. 



1 Introduction 

Mixed initiative [15] is a process in human-computer interaction involving hu- 
mans and machines sharing tasks best suited to their individual abilities. In 
short, the computer performs computationally intensive tasks and prompts 
human-clients to intervene when the machine is unsuited or resource limitations 
demand human intervention. This process is well-suited to document browsing 
using Formal Concept Analysis (FCA) and has been demonstrated in previous 
work in the Conceptual Email Manager (Cem) [5,6,7] and Rental-Fca [8] 1 . 

Mail-Sleuth 2 , shown in Fig 1., follows these ideas by re-using the interac- 
tion paradigm of the Cem embedded within the Microsoft Outlook email client. 
Other related work demonstrates mixed initiative using line diagram animation, 

1 http:/ /www. kvocentral.org/software/rentalfca.html 

2 http:/ /www. mail-sleuth. com 
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Fig. 1 . The final “look” of Mail-Sleutii. The line diagram is highly stylized and 
interactive. Folders “lift” from the view surface and visual clues (red and blue arrows) 
suggest the queries that can be performed on vertices. Layer colors and other visual 
features are configurable. Unrealized vertices are not drawn and “Derived” Virtual 
Folders are differentiated from Named Virtual Folders. A high level of integration with 
the Folder List to the left and the Folder Manager (see tab) is intended to promote a 
single-user Conceptual Information System task flow using small diagrams. Nested-line 
diagrams are not supported, however it is possible to zoom into object sets at vertices 
with a similar effect. 



notably the algorithms in Cernato [l] 3 . Like, Cernato, Mail-Sleuth does 
not employ nested-line diagrams [27,24] instead relying on mixed initiative to 
reduced line diagram complexity. The client is able to determine trade-offs be- 
tween attributes and alter search constraints to locate objects that satisfy an 
information requirement. Because nested-line diagrams are not employed, a ma- 
jor issue is managing the complexity of line diagrams via iterative visualization 
and zooming. Therefore, keeping the diagram simple needs to be encouraged 
by the interface. Further, little or no evidence was available in the literature of 
FCA to support the view that novice individuals could read and interpret line 
diagrams without specialized training. It was widely assumed that difficulties 
resulting from novices using a tool like Mail-Sleuth would inevitably result. 
This assumption needed to be firstly tested and secondly, adjustments made in 
the event that usability problems arose. This paper follows a user-centred test 



3 Cernato is commercial software developed by Navicon AG. 
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Fig. 2. A Conceptual Information Systems (CIS) and its roles (diagram used with 
permission of Becker, 2003). Note that a CIS has three roles, the Conceptual Sys- 
tems Engineering designs scales together with the Domain Expert based on theory or 
practice. The System User may or may not be the same person as the Domain Expert. 



methodology [17], reports its outcomes and the way in which testing conditioned 
the design of Mail-Sleuth and the visualization of line diagrams. 

This paper is structured as follows. Section 2 surveys computer-based FCA 
software systems. A common thread among both commercial and open-source 
FCA tools is the use of a lattice diagram to visualize information content. Mail- 
Sleuth is situated within this software tools survey. Section 3 covers the back- 
ground to information landscapes and conceptual knowledge processing. Section 
4 describes the evolution of the Mail-Sleuth and Section 5 deals with specific 
evidence that conditioned its design. 



2 Tools for FCA 

There are two dimensions of software tools using FCA, these are commercial 
versus open-source and general-purpose versus application specific. 

The longest surviving general purpose platform for FCA is the Glad sys- 
tem [9] which is a general framework for finite lattices, not restricted to FCA. 
TOSCANA, developed over many years by various members of the Research Group 
Concept Analysis ( fz°bw ) in Darmstadt, is better known and specifically tar- 
geted to FCA. Toscana-systems, referring to outcomes from the TOSCANA soft- 
ware framework, are based on a four-step task flow that includes establishing 
conceptual scales, data capture, schema browsing and human interpretation. In 
the usual configuration of Toscana-systems, a program called Anaconda serves 
as the conceptual system editor (to define scales), TOSCANA is then the concep- 
tual system browser and data is stored in Microsoft Access. In Toscana-systems 
there is usually a separation of roles from the individual creating the scales and 
the end user of the system. The task flow is often called a “conceptual informa- 
tion system” [14], its roles and participants illustrated in Fig. 2. 
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Modifications to the TOSCANA program have demonstrated that it can be 
purposed toward specific application problems [10]. In particular, Grolr [13] 
adapted TOSCANA V3.0 to demonstrate the integration of Prediger and Wille’s 
[18] Relational Power Context Families in order to represent and process con- 
cept graphs. However, during this work it became apparent that some of the 
software libraries on which TOSCANA V3.0 was based, namely embedded graph- 
ics libraries from the Borland C++ IDE, would make it difficult for the program 
to migrate to other operating environments. 

In 2000, the GoDa project was established as a collaboration between the 
Knowledge, Visualization and Ordering Laboratory (KVO) in Australia and the 
fz°bw in Darmstadt with the vision for a Framework of Conceptual Knowl- 
edge Processing. The collaboration produced many outputs, one of which is 
the Tockit 4 open-source initiative of which TOSCANA J 5 forms an integral el- 
ement. ToscanaJ is a platform-independent (Java-based) re-implementation 
of TOSCANA v3.0 that supports nested-line diagrams, zooming and filtering. 
ToscanaJ follows the conceptual information systems task flow (shown in Fig. 
2) with Anaconda being replaced by two programs, Elba and Siena 6 . Elba 
and Siena are similar with different emphasis - one is a database schema editor 
the other edits memory-bound schemas. ToscanaJ can talk to any RDBMS 
via the ODBC/JDBC or via an embedded RDBMS. Line diagrams of con- 
cept lattices can be exported in multiple-formats, color is widely used and 
ToscanaJ has more flexible data display features (allowing more varied nu- 
merical data analysis and presentations) than TOSCANA v3.0. ToscanaJ can 
import legacy file formats from its DOS and Windows-based predecessors, Con- 
Imp [2], Toscana and Cernato, as well as the XML-based conceptual schema 
format (.CSX). 

ToscanaJ is not the only general multi-platform tool for formal concept 
analysis to emerge in the open-source era. ConExp 7 is another Java-based open- 
source project that combines context creation and visualization into a single task 
flow software tool. GaLicia [23] is another Java-based research software pro- 
gram (albeit at an earlier development stage to ToscanaJ and ConExp) with 
particular emphasis on experimentation with lattice closure and visualization 
algorithms. 

Like Groh’s adaptation of TOSCANA for concept graphs, ToscanaJ’s source- 
code has been adapted to various application contexts. Two of these, DOCCO and 
Tupleware form part of the Tockit framework (found at http: / /tockit. sf.net). 
Tilley [21] has also adapted the ToscanaJ code in his SpecTrE transformation 
engine for formal specifications in software engineering. 

Prior to 2000, international collaboration in FCA was less organized and 
open-source software projects less popular than today. Warp-9 FCA [4] was a 
first attempt at document retrieval based on a faceted hierarchy re-used from a 



4 http://tockit.sf.net 

5 http://toscanaj.sf.net 

6 The collaboration could not obtain permission to use the name Anaconda J. 

' http://conexp.sf.net 
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medical ontology and mixed initiative. These ideas were refined and applied to 
email in Cem [5,6,7] and for the Web in Rental-Fca [8] and more recently in 
the commercial email management program, Mail-Sleuth. Warp9, Cem and 
Mail-Sleuth owe their origins to earlier information retrieval tools developed 
by Carpineto and Romano [3]. Further, this work builds on the idea of FCA 
for document browsing by Godin and Missoui [12]. Other work that follows this 
literature thread includes Kim and Compton [16], Rock and Wille [20] and Qian 
& Feijs [19]. Rapid iteration, direct manipulation to reduce display complexity 
and the use of conceptual scaling to aid scalability are hallmarks of the later 
work on document browsing and information retrieval using FCA but there are 
no existing studies that test the viability of novice users reading and interpreting 
line diagrams and therefore no indication of the benefit of the work. 

3 Information Visualization and FCA 

A main attraction of FCA has been its visual utility both for general purpose 
Conceptual Information Systems frameworks, characterized by Toscana-systems, 
and also for specialized tools for information retrieval and software engineering. 
Software engineering has been a strong application area for techniques in FCA 
and is thoroughly surveyed by Tilley [22,21] . The emphasis on information visual- 
ization follows in a natural way from Wille’s vision of “landscapes of knowledge” 
which helps define conceptual knowledge processing. 

“The name TOSCANA (= Tools of Concept Analysis) was chosen to 
indicate that this management system allows us to implement conceptual 
landscapes of knowledge. In choosing just this name , the main reason was 
that Tuscany (Italian: Toscana,) is viewed as the prototype of a cultural 
landscape which stimulated many important innovations and discoveries, 
and is rich in its diversity ...” [26]. 

Despite the attraction of line diagrams to those of us within the field, it is 
apparent that the uninitiated have had difficulties interpreting a line diagram as 
an information space. The conventions for reading line diagrams are manifest in 
the earliest literature on FCA and these are (in large) related to lattices being 
drawn on paper (or on a blackboard using chalk). It is difficult from within the 
field to understand the difficulties faced by novice users or break tradition to 
develop new conventions for drawing line diagrams. Even the use of color in 
TOSCANA J attracts critique from FCA-purists but needs to be situated in the 
context of a move away from a paper-based approach to Conceptual Information 
Systems to a more screen-based mixed-initiative interactive approach. 

In the context of the design of a commercial tool like Mail-Sleuth it is 
possible to break with tradition and invent (or re-invent) metaphors more suit- 
able to individuals without specialist training in FCA. This process follows a 
form of user-centered design [17]. The usability tests at Access Testing Centre 
(ATC) 8 requirements and condition the software design to make line diagrams 
more easily understood as an information space by novice users. 



http: / /www. testingcentre.com.au 
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4 Usability Evaluation 

4.1 Comparative Functionality Review 

The comparative functionality review was conducted by two ATC analysts per- 
forming self-determined exploratory testing. The evaluation had a significant 
comparative, comparing the ease of executing various key functions in Mail- 
Sleuth against competitor applications. 

The May 2003 version of Mail- Sleuth had no initial virtual folders when 
the program was first installed and the user had to go through a folder config- 
uration process from scratch. In FCA terms, Mail-Sleuth had no pre-defined 
conceptual scales. While this may suit advanced users, or users expecting the 
program to be a general purpose framework for document browsing using FCA, 
the majority of users are likely to encounter difficulties with this. ATC recom- 
mended a number of useful pre-defined Virtual Folders be employed such as 
“This Week” , and “Attachment” folders for email attachments of different doc- 
ument types and sizes. These form the basis of the Folder List shown to the 
left of Fig. 1. This recommendation was followed and in subsequent versions 
pre-defined Virtual Folders where added including the folders mentioned above 
(various popular document and image attachment types and sizes) and also a 
“follow-up” folder which tests the Outlook follow-up flag. These serve as exam- 
ples and a useful starting point from which users can extend the Virtual Folder 
structure (scale) while benefiting immediately from the software. Other compa- 
rable products derive Virtual Folders from reading the mailbox but the structure 
(once built) cannot be modified or extended as with Mail-Sleuth. This advan- 
tage is highlighted by including an extensible pre-defined folder structure when 
the Mail-Sleuth program is first installed. 

The same time that re-defined Virtual Folders were added, the idea of “User 
Judgments” (reported in [7]) were eliminated. User Judgments allow the user 
to over-ride the automatic classification specified in the attached Query of the 
Virtual Folder. Emails could be drag-and-dropped from regular folders into (or 
out of) existing Virtual Folders. Access Testing Centre (ATC) found this to be a 
powerful (and surprising) feature but one that would appeal only to expert users. 
While the code base for User Judgments still exists under the Mail-Sleuth 
hood it is not presently activated by the interface. 



4.2 User-Based Evaluation 

The user-based evaluation involved one-on-one interviews and was intended to 
evaluate the ease of use and expectations of the user community. Six users were 
drawn from the core target demographic. There was a balance of male and female 
degree qualified individuals who had expressed an interested in new techniques 
to categorize and handle their email. Ages spread from 25 to 50 - at least one 
under 30, at least one over 40. Included in the group where a Librarian, an 
Insurance Manager, a Financial Analyst, a Recruitment Manager, an Imaging 
Specialist and a Personal Assistant. Later, informal tests carried out according 
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Fig. 3. The red (=extent) (— ») and blue (=contingent) ( 4 .) “pop up” on roll over of the 
envelop vertex. The extent and contingent sizes are indicated next to the envelop. The 
cooresponding white numerals on a black background are underline on rollover with 
the appropriate arrow. 



to the ATC script included a Property Development Manager and a Graduate 
Software Engineer. Each user session lasted at most 90 minutes and was directed 
by a usability analyst who observed tasks and recorded relevant data. Each ses- 
sion was then analyzed to identify any usability issues and compile quantitative 
measures. 

4.3 Findings and Actions 

The majority of participants were able to learn the basic operations associated 
with Mail-Sleuth and complete a small number of pre-defined tasks. With 
a simple orientation script (in the place of a help system, incomplete at that 
point), participants could quickly learn to use the software. For example, once 
introduced to the concepts of Virtual Folders and how they are associated with a 
Query (or Queries), participants were able to use the application to create their 
own folders and populate them with appropriate queries. Participant’s indicated 
they found the interface reasonably intuitive and easy to use. 

“An encouraging finding was that participants were able to read the lattice 
diagrams without prompting. Subject six even used the word lattice with- 
out it having been mentioned to her. Participants correctly interpreted 
the major elements - for example, how the ’envelope ’ icons related to the 
mail folders and how derived vertices represented the intersection of two 
folders”. (ATC Final Report, Usability Analysis, September 2003) 

There were still a number of improvements that could be made to the visu- 
alization map, in order to present the lattice more clearly: 

— The start and end nodes could be removed from the legend and blue and red 
arrows could be added. 
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The introduction of the red and blue arrows into the lattice diagram is 
intended to highlight the interactive nature of the lattice diagram as a tool 
for querying emails. This compensates the interface for the fact that only 
named folders can be accessed via the Folder List. The red and blue arrows 
are clues from the line diagram that the extent and contingent are available 
and that Derived Folders can be created by manipulating the diagram. 

— For more complicated structures, less emphasis could be placed on regions 
that are essentially ’unmatched’. This would reduce visual clutter and further 
highlight the relationships that do exist. 

This comment resulted in the elimination entirely of vertices at unrealized 
vertices in the line diagram. Many of the test subjects expressed this idea 
in the usability script. The introduction of the reduced line-diagram was 
included as an option for advanced users. 

— The format for representing total and dispersed emails associated with each 
folder could be more clearly represented - some users indicated that the 
present format (using brackets) represented total and ’unread’ e-mails. A 
reference to the format could be included in the legend. 

Tying together the textual representation of extent and contingent to the 
red and blue arrows (as shown in Fig. 3) resulted and the total (extent) and 
dispersed (contingent) sizes being represented as a fraction. 

— The initial/default node view could be improved - when elements are close 
their labels can overlap. An interesting finding was that some users found 
more complicated diagrammatic representations better conveyed the relation- 
ships to the left-hand folder list. 

The ability to adjust the highlights and font sizes for diagram labels was 
included (along with the ability to color the layered highlights). The ob- 
servation that more complex line diagrams more strongly linked the line 
diagram to the Folder List is because a larger line diagram contains more 
labels appearing in the Folder List. Thus, the correspondence from line dia- 
gram to Folder List is more easily made when there are a larger number of 
intersecting elements. 

Finally, user responses in this small demographic give encouraging indications 
of an implicit understanding of information visualization using line diagrams. 
When shown a very large line diagram our librarian found it overwhelming but 
was certain that there was “value in a lattice of the information space”. More 
specifically, one user said that she preferred a reduced line diagram, namely she 
saw “no reason that points without corresponding data should be drawn at all” . 

When asked what they liked most about the application users responded 
with statements such as; “Defined searches - better time management. Ability 
to separate text from e-mails, program creates folders for you”. We interpret this 
to mean that this user understands that a permanent standing Query is created 
attached to a Virtual Folder. The term “Virtual Folder” was also used by another 
respondent when asked the same question “Drilling down through virtual folders 
to locate specific emails etc.”, this indicates a familiarity with the idea of a 
“Virtual Folder”, either pre-existing or learned during the 30-40 minutes using 
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Table 1. Participants were presented with a number of statements and they were asked 
to select a rating. The range of the ratings went from -2 to +2, to indicate the extent 
to which they agreed with the statement. Here -2 = ’Definitely No’, 0 = ’Uncertain’ 
and +2 = ’Definitely Yes’. 



j Statement 


Ave. Resp. 


1 


Clear how the application is to be used 


1.3 


2 


The interface was simple to use 


0.8 


3 


The application appears to be a useful tool 


1.8 


4 


I liked the layout of the pages 


1.2 


5 


I found the icons intuitive 


0.5 


6 


I found the Quick Search feature was useful 


1.0 


7 


I found the folder view intuitive 


1.3 


8 


I found the diagrammatic view intuitive 


0.8 


9 


Clear relationship, folder view to diagrammatic view 


0.7 


10 


The configuration functionality was useful 


0.8 


11 


I would use this application 


1.7 


12 


I will recommend this application to others 


1.7 



the program. Further, the use of the term “drilling down” in the appropriate 
context of data mining and visualization suggest an encouraging level of comfort 
among the target user group with the terminology of the program. 

Table 1 shows that the user group could use Mail- Sleuth and had a clear 
understanding of its utility. While questions 8 & 9, which relate to visualiza- 
tion of line diagrams, scored relatively poorly compared to other questions, it 
is apparent that the results are nonetheless positive and doubtful that other 
question groups would have been so highly scored if the line diagrams had not 
been understood. Nonetheless, improvements to the visualization aspects of the 
program did result, mostly on the basis of the user’s written comments, and 
these are described in Section 5. 

Considerable time is spent in the development process responding to nega- 
tive comments by users during software evaluations. Negative comments were 
solicited when the group were asked “what they liked least” about the Mail- 
Sleuth application. Responses included: “it takes a few moments to understand 
the 3-D concept as most people are used to a flat & hierarchical folder layout”. 
The response to this has been to include a careful introductory tutorial/help 
system to explain Virtual Folders and Structures and introduce specific “simpli- 
fied” terminology to facilitate an understanding of Mail-Sleuth. It is notewor- 
thy that the Virtual Folder idea also appears as one of the features that people 
liked most. The comment that the “diagram is a bit overwhelming and has badly 
chosen colors” was addressed by giving people the option of choosing their own 
color schemes and font sizes and trying to simplify the line diagram as described 
in the next section. 

5 Design Aids for Interpreting Line Diagrams 

During the comparative review of Mail- Sleuth in May 2003 a comment was 
made by co-author Peter Brawn that, ’’the drawing conventions for a lattice 
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diagram were no different from a graph in Mathematics. What makes this a 
lattice diagram and not a graph? How do I know that I should read this top to 
bottom?” 

A line diagram (or concept lattice) is a specialized Hasse diagram with several 
notational extensions. Line diagrams contain vertices and edges with the vertices 
often labeled dually with the intent (above) and extent (below). Rather than 
labeling each node in the line diagram with its intent and extent a reduced 
labeling scheme can be used and each object (and attribute) appears ouly once. 
In many Toscana-systems (and in Cem) a listing of the extent is often replaced 
with a number representing the cardinality of the extent (and/or the contingent). 

In Hasse diagrams, edges (representing the cover relation) are unlabeled. It 
is well understood in Mathematics that an ordered set is transitive, reflexive and 
antisymmetric. To simplify the drawing of a ordered set (via its cover relation) 
the reflexive and transitive edges are removed, and the directional arrows of the 
relation are dropped. It is therefore meant to be “understood” that the Hasse 
diagram is hierarchical with the edges pointing upward. In other words, if x < y 
in the poset then x appears at a lower point that y in the diagram. 

“The highlighting of adjoining lines is meant to illustrate relationships 
within the lattice and this could he clearer. There is a hierarchy within the 
lattice, which could be reinforced through the use of arrows on connecting 
lines that appear upon rollover..” (ATC Functional Testing Report, May 
2003) 

Access Testing Centre (ATC) suggested arrowheads be used in the line di- 
agram to reinforce its hierarchical character. This represents an unacceptable 
violation of a convention dating back to (at least) to Helmut Hasse’s 1926 book 
Hohere Algebra , so some other mechanism to reinforce hierarchy without tam- 
pering with the edge notation in the line diagram had to be found. 

To insinuate structure the idea of a layered line diagram was introduced. The 
principle is iterative darkening with dark at the top to light at the bottom, shades 
progressively lighter as one moves from one level to the next. This is shown in 
Fig. 4. The top and bottom elements of the lattice have also been replaced with 
special icons indicating “All Mail” and “No Mail” (when the bottom element is 
the empty set of objects). In combination, layering and icon shapes are intended 
to suggest the top-to-bottom reading of the line diagram. 

Shading does not interfere with the conventions of drawing line diagrams 
because it operates as a backdrop to the line diagram. It can also be turned 
off if the line diagram is to be embedded in a printed document. However, the 
interaction of the layout algorithm and background layering fails (background 
layers are not aligned) in line diagrams with high dimensionality as shown in Fig. 
5 (left) requiring human intervention to produce something readable as shown 
in Fig. 5 (right). It is possible to use the alignment of the background layers 
to guide the manual layout process. Nonetheless, once layering was used, it was 
apparent from test subjects that they were (without prompting) able to explain 
(and read) the line diagram from top-to-bottom and bottom-to-top. 
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Fig. 4. A line diagram from the August 2003 version of Mail-Sleuth. Layering is 
evident to suggest a hierarchical reading. Top and bottom elements have been especially 
iconihed as arrowheads. Unrealized vertices are differentiated. Realized vertices have 
are split into two inconic categories “Named Folders” with an intent label with a white 
envelop and “Derived Folders” , whose intent needs to be ‘derived” as an orange envelop. 
Cardinality labels have been replaced with dual labels for “extent (contingent)”. Users 
complained that the help system was hard to activate or they couldn’t find it and did 
not recognize the “?” icon as being related to “help”! Note the inclusion of a Quick 
Search bar at the top which provides an starting point for search. 



“It was observed that most nodes in the lattice are depicted using the ex- 
act same icon, even though there are a variety of nodes. In particular, the 
root node, which represents the set of all emails, should be differentiated 
from all other nodes.. ” (ATC Report, May 2003) 

In Mail-Sleuth (and in Cem) the top of the lattice represents all emails 
in the collection. Some of the vertices shown in the line diagram correspond 
with actual Virtual Folders that exist in the Folder List to the left, while other 
vertices represent derivations of the named Virtual Folders. It is useful to in- 
dicate, through different icon types, which vertices are named Virtual Folders 
(appearing in the Folder List), and which are derived. This led to the idea of a 
“Derived Folder” , a type of Virtual Folder that does not appear in the Folder 
List and whose name is “derived” from the named Virtual Folders (attribute 
names) above it in the line diagram. 

The number of e-mails represented in each node could also be more clearly il- 
lustrated. For example, where totals for vertices and intersections are concerned, 
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Fig. 5. The interaction of the line diagram layout algorithm and background layering 
can produce an odd effect (left) but with human intervention layering can also be used 
as a guide to adjust the line diagram by hand (right) by moving vertices to align the 
layers. Note that the buttons in Fig. 4 have been replaced with tabs and that the help 
system is more consistently located to the top right. The Quick Search bar is visually 
highlighted and placed toward the bottom of the screen for greater emphasis. 



two numbers could be displayed corresponding to the extent and contingent size 
in the form extent_size (contingent_size) in Fig. 4. 

When drawing “reduced line diagrams” vertices which are unrealized are ex- 
cluded but automatic layout can be problematic. Because Mail-Sleuth was 
designed for the non-expert, it was decided early to compromise to ensure that 
the lattice diagram was always “readable” as a default layout and reduced-line 
diagrams not used as a default. Where elements of a scale are unrealized, the 
entire label is excluded from the diagram however what remains is drawn as 
a boolean lattice with an option for a reduced line diagram. This means that 
certain combinations of realized scale elements may themselves be unrealized. 
Convention dictated that these be displayed as a vertex in the lattice somehow 
distinguishable from realized vertices (or not at all). In Fig. 6 unrealized ver- 
tices are the same shape and size as realized vertices , the only difference being 
the presence or otherwise of an envelop icon within the vertex. To distinguish 
unrealized from realized vertices they were reduced in size as shown in Fig. 4. 
Top and bottom vertices (when the bottom was an empty set of objects) were 
also iconified. In addition, realized vertices are identified in two ways. The first 
where the intent label matches a “Named Folder” in the Folder List of Outlook 
(to the left of Fig. 1). The second, where vertices representing the intent labels 
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Fig. 6. A line diagram from the May 2003 version of Mail-Sleuth. Most of the usual 
FCA line diagram labeling conventions are followed with the exception of iconifying 
vertices with an envelop. There is no obvious “search point” (meaning no clear starting 
place to commence the search) and limited visual highlighting in the diagram itself. 
Structural diagrammatic constraints are imposed, concepts cannot be moved above 
their superconcepts for instance. 



of the upper covers, these may have common attribute names (Named Folder 
names) and are colored orange. To avoid cluttering the diagram with labels on 
all vertices the interface gives scope to query a orange envelop and the result is 
a new Virtual Folder named after the intent labels of its upper covers appearing 
in the “Mail-Sleuth search results” in the Folder List. 

“.. get rid of the grey blobs...” [User 2] 

Because we are dealing with objects that are emails and not ball bearings 
it was natural to replace the stylized vertices (a legacy of the Hasse diagram) 
with a literal iconic representation relevant to the domain. In the case where 
“Derived Folders” were unrealized, no vertex is drawn, where data is present 
a envelop replaces the envelop/bail icon combination as shown in Fig. 1. Top 
and (empty) bottom vertices appear at most once in a line diagram and so 
are removed from the legend (shown in the legend of Fig. 4 but not in Fig. 1) 
and labeled accordingly in the diagram itself (shown in Fig. 1). The ability to 
manipulate the line diagram in four directions via the “Pan” widget appears in 
Fig. 1, and the envelops animate by “appearing to lift” on rollover with drop 
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Fig. 7. Mail-Sleuth tries to accommodate a new user community to document brows- 
ing using FCA. Hiermail (shown above) has a much stronger conformity to diagram- 
matic traditions in FCA. It is effectively a version of Mail-Sleuth with a ToscanaJ- 
like skin. 



shadowing helps suggest that vertices in the line diagram can be moved and 
therefore manually adjusted by the user. 

“Hide the lattice-work where no relationships exists. ” [User 6] 

Edge highlighting has been used to emphasis relationships in line diagrams in 
both TOSCANA J and in Cem. This idea is mainly used as a method to orient the 
current vertex in the overall line diagram so that relationships can be identified. 
TOSCANA J allows the edges of the line diagram to be labeled with the ratio 
of object counts to approximate the idea of “support” in data mining. That 
program also uses the size of the extent to determine the color of a vertex. A 
number of other significant functions for listing, averaging and visualizing the 
extent at a vertex are also provided by TOSCANA J. 

Trying to create a new user community with Mail-Sleuth is an interesting 
exercise but the original user community also requires attention. Hiermail 9 is 
a version of Mail-Sleuth for the FCA community that conforms to the dia- 
grammatic conventions of TOSCANA J. It took only a matter of days to rollback 
the design lessons learned from over four months of usability testing and design 
refinement with Mail-Sleuth to produce Hiermail as shown in Fig. 7. 

9 http://www.hiermail.com 
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6 Conclusion 

This paper canvasses a number of usability and visualization issues for interpret- 
ing and understanding line diagrams based on the design experience gained from 
testing Mail-Sleuth. It tests, on a very small scale, the ability of novices to 
FCA to understand line diagrams. The results are promising and indicate that 
novice users can read and interpret line diagrams. 

The design choices made do not represent the only possibilities for helping 
novice users understand lattice diagrams but rather are determined by con- 
straints on time, resources and programming utilities available to the Mail- 
Sleuth platform. Nonetheless, the choices were suitably tested and result in 
promising outcomes, users untrained in FCA were able to read and interpret 
line diagrams and this discovery argues for a less complex task-flow for domain- 
specific applications such as Mail-Sleuth. Naturally, “deep knowledge” can 
only be gained by complex scale interaction and there is very little of that in 
Mail-Sleuth at its present stage of development. The test candidates were only 
confronted with small diagrams, direct products of chains, and did not zoom into 
vertices (although that functionality exists in Mail-Sleuth). 

The results are therefore preliminary and anecdotal but the methodology 
followed is a limited example of user-centered designed based on a case-study. 
Some aspects of the presentation of line diagrams are impossible to adjust but 
other devices can be introduced to give visual clues on the correct meaning of 
the diagram and its interactivity. In the process of experimenting with these 
ideas this paper catalogs the design evolution of the Mail-Sleuth program. A 
much larger usability test needs to be undertaken to verify the findings. These 
early results are however promising indicators that novice users can read line 
diagrams. 
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Abstract. This paper presents the JBraindead Information Retrieval System, 
which combines a free-text search engine with online Formal Concept Analysis 
to organize the results of a query. Unlike most applications of Conceptual Clus- 
tering to Information Retrieval, JBraindead is not restricted to specific domains, 
and does not use manually assigned descriptors for documents nor domain spe- 
cific thesauruses. Given the ranked list of documents from a search, the system 
dynamically decides which are the most appropriate attributes for the set of 
documents and generates a conceptual lattice on the fly. This paper focuses on 
the automatic selection of attributes: first, we propose a number of measures to 
evaluate the quality of a conceptual lattice for the task, and then we use the pro- 
posed measures to compare a number of strategies for the automatic selection of 
attributes. The results show that conceptual lattices can be very useful to group 
relevant information in free-text search tasks. The best results are obtained with 
a weighting formula based on the automatic extraction of terminology for the- 
saurus building, as compared to an Okapi weighting formula. 



1 Motivation 

Clustering techniques, which are a classic Information Retrieval (IR) technique, are 
only now becoming an advanced feature of Web search engines. Vivisimo 
(www.vivisimo.com), for instance, performs a standard web search and then provides 
a hierarchical clustering of the search results in which natural language expressions 
label each node. The results of the search “jaguar” with Vivisimo automatically dis- 
plays a taxonomy of results with nodes such as “Car” (referring to the Jaguar car 
brand), “Mac OS X” (also known as Jaguar), or “animal”, which permit a fast re- 
finement of the search results according to the user needs. If the user, for instance, 
expands the node “animal”, results are classified in four subclusters, namely “wild 
life”, “park zoo”, “adventure amazon” and “other topics”. Other examples of cluster- 
ing techniques in the web include the Altavista (www.altavista.com) search engine, 
which displays a set of suggestions for query refinement that produce a similar clus- 
tering effect, and the Google news service (news.google.com), where news from hun- 
dreds of web servers are automatically grouped into uniform topics. 

A common feature of such web services is that clustering is applied to a small set 
of documents, which come as a result of a query (in search engines) or filtering pro- 
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file. At this level, clustering proves to be an enabling search technology halfway be- 
tween browsing (as in web directories, e.g. Yahoo.com or dmoz.org) and querying (as 
in Google or Altavista). Pure browsing is useful for casual inspection/navigation (i.e., 
when the information need is vague), and querying is useful when the information 
need is precise (e.g. I am looking for a certain web site). Probably the majority of web 
searches lie somewhere between these two kinds of search needs, and hence the bene- 
fits of clustering may have a substantial impact on user satisfaction. 

A natural question arises: can Formal Concept Analysis (FCA) be applied to 
browse search results in a free text IR system? FCA is a conceptual clustering tech- 
nique that has some advantages over standard document clustering algorithms: a) it 
provides an intensional description of each cluster, which makes groupings more 
interpretable, and b) cluster organization is a lattice, rather than a hierarchy, facilitat- 
ing recovery from bad decisions while exploring the hierarchy and, in general, provid- 
ing a richer and more flexible way of browsing the document space than hierarchical 
clustering. 

The idea of applying FCA only to a small subset of the document space (in our 
case, the results of a search) eliminates some of the problems associated to the use of 
FCA in Information Retrieval: 

• FCA is computationally more costly than standard clustering, but both can be 
equally applied to small sets of documents (in the range of 50-500) efficiently 
enough for online applications. 

• Lattices generated by FCA can be big, complex and hence difficult to use for prac- 
tical browsing purposes. In particular, it can produce unmanageable structures 
when applied to large document collections and rich sets of indexing terms. Again, 
this should not be a critical problem when the set of documents is restricted in size 
and topic by a previous search over the full document collection. 

But clustering the results of a free text search is not a straightforward application of 
FCA. Most Information Retrieval applications of FCA are domain-specific, and rely 
on thesauruses or (usually hierarchical) sets of keywords which cover the domain and 
are manually assigned as document descriptors (see section on Related Work). Is it 
viable and useful to apply FCA without such manually built knowledge? 

The JBraindead system, which combines free-text searching with FCA on search 
results, is a prototype Information Retrieval system that serves as a testbed to investi- 
gate this research question. In this paper, we focus on the first essential aspect on the 
application of FCA to free-text searching: what is the optimal strategy for the auto- 
matic selection of document attributes? 

In order to answer this question, we first need to define appropriate evaluation met- 
rics for conceptual lattices in free-text search tasks, and then compare alternative 
attribute selection strategies with such metrics. Therefore, we will start by defining 
two different metrics related to the user task of finding relevant information: 1) a 
lattice distillation factor measuring how well the document clusters in the lattice 
prevent the user from accessing irrelevant documents (compared to the original 
ranked list returned by the search engine), and 2) a lattice browsing complexity meas- 
uring how many node descriptions have to be examined to reach all the relevant in- 
formation. An optimal lattice will have a high distillation factor and a low browsing 
complexity. With these measures, we will compare two different attribute selection 
criteria: a standard IR weight (Okapi) measuring the discriminative importance of a 
term with respect to the collection, and a “terminological weight” measuring the ade- 
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quacy of a term to represent the content of a retrieved subset as compared to the full 
collection being searched. We will also study which is the adequate number of attrib- 
utes to build an optimal conceptual lattice for this kind of task. 

The paper is organized as follows: Section 2 provides a brief description of the 
functionality and architecture of the JBraindead system. Section 3 summarizes the 
experimental setup and the results obtained; Section 4 reviews related work, and Sec- 
tion 5 offers some conclusions and discusses future work. 



2 The JBraindead Information Retrieval and Clustering System 

JBraindead is a prototype 1R system that applies Formal Concept Analysis to organize 
and cluster the documents retrieved by a user query: 

1. Free-text documents and queries are indexed and compared in a vector space 
model, using standard tf*idf weights. For a given query, a ranked list of documents 
is retrieved using this model. 

2. The first n documents in the ranked list are examined to extract, from the terms in 
the documents, a set of k optimal descriptors according to some relevance weight- 
ing formula. 

3. Formal Concept Analysis is applied to the set of documents (as formal objects), 
where the formal attributes of each document are the subset of the k descriptors 
which are contained in its text. 

4. Besides the intensional characterization of each concept node, an additional de- 
scription is built with the most salient phrasal expressions including one or more 
query terms. This additional characterization is intended to enhance node descrip- 
tions for the query-oriented browsing task that conceptual lattices play in JBra- 
indead. 

5. The resulting annotated lattice is presented to the user, which can browse the top n 
results by traversing the lattice and/or refine the query at some point. In its current 
implementation, query refinement can only be made as a direct query reformula- 
tion. 

The core of the process lies in steps 2 and 4. Step 2 determines the attribute set for 
a given document set, and then, implicitly, also defines the conceptual lattice. Step 4 
enriches the intensional description of concept nodes with query-related phrases, 
defining how the lattice nodes will be presented to the user. 

Figure 1 shows the JBraindead interface for the results of the query “ pesticidas en 
alimentos para bebes ” (pesticides in baby food) when searching the standard CLEF 
collection of news in Spanish (see next section for details). Both “ pesticidas ”, “ali- 
mentos” and “bebe’ appear as second-level nodes in the conceptual lattice. Other 
attributes automatically selected by JBraindead are “potitos” (a kind of baby food 
which happened to be recipient of pesticides), “lindano” (the kind of toxic waste 
found in the baby food), “Hero” (a baby food brand), or “ toxicos ” (toxic). JBraindead 
also extracts complex node descriptions including node attributes, such as “ alimentos 
para bebes” (baby food) or “agricultura y alimentacion ” (food and agriculture). 
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Fig. 1 . JBraindead system results for the query “pesticidas en alimentos para bebes” (pesticides 
in baby food) and the CLEF EFE 1994 news collection. 



3 Experiments in Attribute Selection 

This section describes a set of experiments designed to find automatically optimal 
attributes for the set of documents retrieved for a given query in the JBraindead sys- 
tem. We describe a) the Information Retrieval testbed used in our experiments, b) a 
set of measures proposed to evaluate the quality of conceptual lattices for the pur- 
poses of grouping free-text search results, c) the experiments carried out, and, finally, 
we discuss the results obtained. 



3.1 Information Retrieval Testbed 

A manual, qualitative inspection of the lattices generated by JBraindead on the results 
of random queries can provide an initial feedback on the quality of the process. But 
for a systematic comparison of approaches, an objective measure is needed. While the 
final purpose of the system is to improve user searches, studies involving users are 
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costly and should only be performed for final testing of already optimized alterna- 
tives; hence, we wanted to find an initial experimental setup in which we could tune 
the process of selecting document attributes before performing user studies. 

The Spanish CLEF EFE-1994 collection that we have used during system devel- 
opment includes a set of 160 TREC-like topics (used in CLEF 2001, 2002 and 2003 
evaluation campaigns) with manual relevance assessments from a rich and stable 
document pool [13], forming a reliable and stable test bed for document retrieval 
systems. Out of this set, we have used topics 41-87 coming from the CLEF 2001 and 
2002 campaigns. 

If we feed JBraindead with CLEF topics, we can study how the conceptual lattices 
group relevant and non relevant documents. The baseline is the ranked list of docu- 
ments retrieved by the initial search: to discover all relevant documents in the set, the 
user would have to scan at least all documents until the last relevant document is 
identified. If the FCA process group relevant documents and the node descriptions are 
useful indicators of content, then browsing the lattice for relevant documents could 
provide the same recall while scanning only a fraction of the initial set of retrieved 
documents, saving time to the user and offering a structured view of the different 
subtopics among relevant documents. 



3.2 Evaluation Measures 

How can we measure quantitatively the ability of the FCA process to group relevant 
documents together? A couple of standard clustering measures are purity and inverse 
purity. Given a manual classification of the documents into a set of labels, the preci- 
sion of each cluster P with respect to a label partition L (containing all documents 
assigned to the label), the precision of P is the fraction of documents in P which be- 
long to L. The purity of the clustering is then defined as the (weighted) average of the 
maximal precision values of each cluster P, and the inverse purity is defined as the 
weighted average of the maximal precision values of each partition L over the clus- 
ters. Purity achieves a maximal value of 1 when every cluster has one single docu- 
ment, and inverse purity achieves a maximal value of 1 when there is only one single 
cluster. 

Purity and inverse purity are, then, inadequate measures for the conceptual cluster- 
ing generated by FCA: the cluster structure of a conceptual lattice is much richer than 
a plain set of labels; and, in addition, the only distinction that we can make for this 
experiment is between relevant and non relevant documents. What we want to meas- 
ure is whether the lattice structure effectively “distillates” relevant documents to- 
gether, allowing the user to locate relevant information better and faster than in a 
ranked list of documents. Hence we introduce here a “lattice distillation factor ” 
measure which relies on a notion of minimal browsing area that we introduce now. 

3.2.1 Lattice Distillation Factor 

Let C be the set of nodes in a conceptual lattice, where documents are all marked as 
relevant or non-relevant for a given query. Let us assume that, when visiting a node, 
the user sees the documents for which the node is their object concept. We will use 
the term relevant concept to denote object concepts generated by, at least, one rele- 
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vant document, and irrelevant concept to denote object concepts generated only by 
one or more irrelevant documents. 

We define C REI c: C as the subset of relevant concepts in the lattice. In order to find 
all relevant documents displayed in the lattice, the user has to examine, at least, the 
contents of all concepts in C RKL . We define the minimal browsing area (MBA) as the 
minimal part of the lattice that a user should explore, starting from the top node, to 
reach all the relevant concepts of C REI , minimizing the number of irrelevant docu- 
ments that have to be inspected to obtain all the relevant information. We can think of 
the precision of the MBA (ratio between relevant documents and overall number of 
documents in the MBA) as an upper bound on the capacity of the lattice to “distillate” 
relevant information from the search results. The lower bound is the precision of the 
original list: the user has to scan all documents retrieved to be sure that no relevant 
document is being missed from that list. 

The lattice distillation factor (LDF) can then be defined as the potential precision 
gain between the lattice and the ranked list, i.e., as the percentual precision gain be- 
tween the minimal browsing area and the original ranked list: 



LDF(C) = Precision^ - Precision^ 1Q() 
Precision^ 



( 1 ) 



Note that the minimal browsing area and the distillation factor can be equally ap- 
plied to hierarchical clusters or any other graph grouping search results. 

The only difficulty to calculate the distillation factor lies in how to find the mini- 
mal browsing area for a given lattice. In order to calculate this area, we will create an 
associated graph were all nodes are relevant concepts, and where the cost associated 
to each arc is related to the number of irrelevant documents which will be accessed 
when traversing the arc. Then we will calculate the minimal span tree for such graph, 
which will give the minimal browsing area: 

1. We start with the original lattice (or any directed acyclic graph). We define the cost 
of any arc reaching a relevant o irrelevant concept node, from one of its upper 
neighbors, as the number of irrelevant documents that are fully characterized by 
the node. E.g., if we have an object concept c, such as, yd, = yti = yd 3 = c, where d, 
and d 2 are non-relevant documents, all arcs reaching c will have a cost of 2. 

2. In a top-down iterative process, we will suppress all nodes which are not relevant 
concepts. In each iteration, we select the node j which is closest to the top and is 
not a relevant concept, j is deleted and, to keep connections between ancestors and 
descendants of the node, we create a new arc for every pair of nodes (u,l) e UjX 
Lj, where Uj and Lj are the sets of upper and lower neighbors of j. A cost of 
cost(u,l) = cost(u,j) + cost(j,l) is then assigned to the new arc. If we end up with 
more than one arc for a single pair of nodes (u,l), we select the arc with the lowest 
cost and suppress the others. 

3. The result of the iteration above is a directed acyclic graph whose nodes are all 
relevant concepts. The minimal span tree of this new graph tells us which is the 
minimal browsing area in the original lattice. 

Figure 3 shows an example of how to build the minimal browsing area and calcu- 
late the lattice distillation factor. 
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3.2.2 Lattice Browsing Complexity 

The distillation factor is only concerned with the cost of reading documents. But 
browsing a conceptual structure has the additional cost (as compared to a ranked list 
of documents) of examining node descriptions and deciding whether each node is 
worth exploring. For instance, a lattice may lead us to ten relevant documents and 
save us from reading another ten irrelevant ones... but force us to traverse a thousand 
nodes to find the relevant information! Therefore, the number of nodes in the lattice 
has to be considered to measure its adequacy for searching purposes. 

There might be also the case that a lattice has a high distillation factor but a poor 
clustering, forcing the user to consider most of the nodes in the structure. An example 
can be seen in Figure 2, where all the object concepts occur near the lattice bottom. 
Precision for the minimal browsing area is 1, and the lattice distillation factor is 
100%. The clustering, however, is not good: the user has to consider (if not explore) 
all node descriptions to decide where the relevant information is located. 




Fig. 2. This lattice has a high distillation factor (LDF = 100%), but the clustering is poor. 

We need to introduce, then, another measure estimating the percentage of nodes 
that must be considered (rather than visited) in a lattice in order to reach all relevant 
information. We propose a measure of the lattice browsing complexity ( LBC) as the 
proportion of nodes in the lattice that the user sees when traversing the minimal 
browsing area. The idea is that, when a node is explored, all its lower neighbors have 
to be considered, while only some of them will be in turn explored. 

Being C the set of nodes in the concept lattice, the set of viewed nodes C VIEW is 
formed by the lower neighbors of each node belonging to the minimal browsing area. 
The lattice browsing complexity is the percentage of lattice nodes that belong to C vlEW : 
LBC(C) = | C VIEW |/|C|*700. Figure 4 shows an example of how the lattice browsing 
complexity is calculated. 

3.3 Experiments 

We have used CLEF topics 41-87, corresponding to the CLEF 2001 campaign. For 
each experiment, all topics (title+description) are searched. For every search, a formal 
context K=(GM,I) is build, where G is the set of the first 100 documents returned by 
the search engine in response to a query, M is the set of attributes (variable between 
experiments), and d 1 1 iff the attribute / is a term occurring in document cl. 

The two weighting measures used to generate the formal contexts are the Okapi 
weighting scheme and the terminological formula proposed in [12], 
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LDF = (4/5-4/7V(4/7)*1 00=40% 



Fig. 3. Calculation of the Lattice Distillation Factor. 
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Fig. 4. Concept Lattice with LBC = 61%. MBA links are represented as continuous lines on the 
concept lattice. LBC nodes are drawn as oval nodes. Circular nodes represent nodes which are 
not seen when traversing the MBA. 



3.3.1 Okapi 

Term weights in an IR system measure the importance of a term as a discriminative 
descriptor for a given document. We have selected the Okapi BM25 weight, which 
has given the best results for the CLEF collection used in our experiments [15]: 
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where the parameters k n b and avdl are adjusted for the Spanish CLEF test collection 
with the values b = 0.5, k I = 1.2, and avdl = 300 taken from [15]. tf. represents the 
frequency of the term i in the document (in our case, in the set of retrieved docu- 
ments), and f. the document frequency of term i in the whole collection. Finally, l m 
represents the total length (in terms) of the retrieved document set. 



3.3.2 Terminological Weight Formula 

A terminological weight is designed to find, in a collection which is representative 
from some specific domain, which terms are more suitable as descriptors for a thesau- 
rus of the domain. We use a variation of a formula introduced in [12] which compares 
the domain- specific collection with a collection from a different domain, and assigns 
a higher weight to terms that are more frequent in the domain-specific collection than 
in the contrastive collection. In our case, the domain-specific collection can be the set 
of retrieved documents, and the contrastive collection is the whole document collec- 
tion minus the retrieved set: 
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where vv is the terminological weight of term i, tf lm is the relative frequency of term i 
in the retrieved document set, f im is the retrieved set document frequency of term i, 
and tf lcol is the relative frequency of term i in the whole collection minus the retrieved 

set. 



3.3.3 Number of Attributes 

Finally, for the formulas above we have studied how the number of attributes affects 
the size and quality of the conceptual lattice. After some initial testing, we have kept 
the number of attributes between 10 and 20: with less attributes, the clustering capac- 
ity is too low, and with more than 20 attributes, the number of nodes becomes exceed- 
ingly high for browsing purposes, and the computational cost makes online calcula- 
tion too slow. 



3.4 Results 

Table 1 shows the basic outcome of our experiments. In all cases, the Distillation 
Factor is high, ranging from 346% to 594%. Note that this measure is an upper bound 
on the behavior of real users: only an optimal traversing of the lattice will give such 
relative precision gains. Note also that the microaveraged LDF is much higher than 
would result from the average precisions of the ranked list and minimal browsing 
area. This is because the LDF expresses the relative precision gain rather than the 
absolute precision gain. 

For 10 attribute terms, the Okapi formula gives a higher Distillation Factor (580% 
versus 346%) but at the cost of a much larger lattice (70 nodes in average versus 35 
nodes with the terminological formula). Both differences are statistically significant 
according to a paired t-test (p<0.05). In practice, the Okapi formula generates too 
large lattices for browsing a hundred documents, hence the terminological formula 
should give better results with experiments involving users. 

The LDF seems to grow linearly with the number of attributes, and the complexity 
factor seems to decay linearly the number of attributes. The number of nodes, how- 
ever, grows almost exponentially. For 15 terms, the number of nodes generated by the 
terminological formula is already too large (94 nodes) for practical purposes. 

Overall, it seems clear that conceptual lattices can be very effective to group rele- 
vant information, and the grouping effect is higher for larger attribute spaces. But the 
number of nodes quickly becomes impractical. From this point of view, understading 
attributes as potential terminological units seems to give more compact lattices than 
seeing attributes as IR indexing terms. 



Table 1. Experimental results. The average precision of the original ranked lists was 0.17. 
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Prec. 

MBA 


LDF 


# Nodes 


# Nodes 
Viewed MBA 


LBC 


Terminological 

Formula 


10 

15 

20 


0.35 

0.43 

0.52 


346% 
493 % 
594% 


35 

94 

184 


19 

50 

65 


54% 
43 % 
36% 


Okapi 


10 


0.43 


580% 


70 


32 


44% 
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4 Related Work 

The application of FCAs to Information Retrieval is an increasingly successful field, 
which has already produced some commercial applications, although all research 
known to us concentrates on manually (or semi automatically) indexed or classified 
according to some domain specific thesaurus or classification scheme. 

Two early applications for which empirical tests with users were conducted are 
[11] and [1]. In [11], navigation in a Galois lattice is compared to boolean retrieval 
and hierarchical navigation in an Information Retrieval task involving users. In the 
experiment, recall obtained using lattices and boolean retrieval is superior to naviga- 
tion in a hierarchy. The document collection consisted on 113 short animation film 
descriptions, and every document was manually indexed by an average of 6.53 classi- 
fication terms. In ]1], a lattice conceptual clustering system is proposed that incorpo- 
rates background knowledge from the indexing thesaurus (i.e. the broader/narrower 
term relationships in the thesaurus) into the process of building the conceptual cluster- 
ing lattice. Browsing with and without the background knowledge were compared in 
the context of users searchers against a collection of 1555 documents about Artificial 
Intelligence extracted from a computer engineering collection (INSPEC). Browsing 
with background knowledge led a 30% relative improvement in recall, showing that 
the incorporation of specificity relations between indexing terms is a significant im- 
provement over building the lattice without considering the relations between the 
keywords manually assigned to documents. 

One of the application domains that has received more attention is medical docu- 
mentation. In [3], a set of 9000 patient medical discharge summaries are indexed 
using SNOMED (Systematized nomenclature of medicine), showing the viability of 
the approach. The approach has a continuation in [4,2] and [5], where documents are 
automatically indexed using UMLS (Unified Medical Language System) metathesau- 
rus terms, and the notions of conceptual scales and purified contexts are introduced 
for improved, scalable knowledge visualization. Unfortunately, no empirical, quanti- 
tative evaluations or user studies have been conducted in this domain, to our knowl- 
edge. 

FCA has also been applied to document retrieval in conjunction with faceted the- 
sauruses, a notion which is related to conceptual scales, in which different aspects of 
an article description (for instance, the topic of an article and the level of difficulty) 
have descriptors in different facets of the thesaurus. The IR system FaIR [14] is an 
example of such a system, which is applied to an on-line collection of about 5000 
FAQ documents of computing questions. Another application in the computer domain 
is Aran [9], an Information Help System that applies FCA to Unix man pages. A 
characteristic feature of this system is that it does not employ any prior thesaurus; 
indexes are obtained from free text in the short (one or two lines) command descrip- 
tions that summarize every unix command. As in the medical domain, none of these 
systems have been quantitatively evaluated. 

An attractive example of the possibilities of FCA for knowledge management is 
the HIERMAIL system [8,6], which provides a structured ontology and IR system for 
Email search and discovery, in which the principles of FCA are supported by an in- 
verted file index that provides efficient client iteration. Although there is no empirical 
evaluation of the utility of the system (perhaps because it is not trivial to design an 
evaluation for knowledge management tasks), an indirect evidence of its value is that 
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the idea of applying FCA to e-mail management has already reached the market with 
the Mail-Sleuth application (http://mail-sleuth.com). 

More recently, [7] combine Information Extraction on web documents with FCAs 
in an information access application on the domain of classified advertisements for 
Real Estate properties. Rather than simply extracting keywords from documents, the 
Information Extraction process extracts template-based data to describe advertise- 
ments, improving the input to the FCA process. 

Finally, a work which is similar in spirit to the JBraindead approach is described in 
[10]. The authors cluster a news collection (Reuters-21578) combining a standard 
clustering technique, which is applied to the whole collection, with FCA, which is 
applied individually to every cluster produced in the first process. One of the salient 
features of the system is that they use a general purpose lexical knowledge base 
(WordNet) rather than a domain specific thesaurus as background knowledge, both 
for the initial clustering process and for the subsequent Conceptual Clustering step. In 
practice, that means that the input for FCA is closest to free indexing terms than in 
any of the applications mentioned above. IBraindead uses a similar approach, but in 
an IR application: Hotho and Stumme apply FCA to smaller subsets of the collection 
by applying a standard clustering technique, and then performing FCA on every clus- 
ter returned; JBraindead applies FCA to smaller subsets of the collection by applying 
standard IR, and then performing FCA online on the results of the search. It is worth 
mentioning that in Hotho and Stumme' s work, the indexes for the FCA process are 
the terms with higher values in the centroid vector representing the cluster. The com- 
bination of WordNet and centroid vectors is an interesting alternative to the methods 
evaluated in this paper, which we seek to adapt and compare with our current key- 
word extraction procedures. 



5 Conclusions and Future Work 

We have described the JBraindead Information Retrieval system, which combines 
standard IR techniques with online conceptual clustering applied on the results of the 
initial user query. The system is domain independent and operates without resorting to 
thesauruses or other predefined sets of indexing terms. Hence, the contributions of 
JBraindead to the application of FCA in Information Retrieval lies in the approaches 
to extract indexing terms for the FCA process and to build natural descriptions of the 
nodes in the resulting lattice. 

In this paper we have focused on the process of attributes selection, comparing two 
weighting schemas of different nature: the Okapi probabilistic weights, related to the 
discriminative power of a term for IR purposes, and a terminological weight related to 
the adequacy of a term as topic -specific descriptor. We have also measured the influ- 
ence of the number of attributes in the quality of the outcoming lattice for searching 
purposes. We have made a special emphasis in the definition of metrics to compare 
different conceptual structures for the task of browsing free-text results, introducing: 
a) a lattice distillation factor, related to how well the conceptual structure prevents the 
user from reading irrelevant documents, and b) a lattice browsing complexity, related 
to the proportion of nodes in the structure that have to be considered to reach all rele- 
vant information. An optimal lattice will have a high distillation factor, a low brows- 
ing complexity and a low number of nodes. 
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The results show that the terminological weighting is better than the IR Okapi 
weight, and that an increasing number of attributes improves the distillation factor at 
the cost of a higher browsing complexity. Most differences between runs are statisti- 
cally significant, showing that the quality of the conceptual structures is highly sensi- 
tive to parameter settings. 

The JBraindead system illustrates the scalability of FCA to unrestricted Informa- 
tion Retrieval settings, if it is applied to organize search results, rather than trying to 
structure the whole document collection with conceptual analysis. To our knowledge, 
this is the first IR system based on FCA that operates on a collection of more than 500 
Mb comprising more than 200,000 documents. 

JBraindead provides, as well, a test bed to study optimal querying, indexing, visu- 
alization and refinement strategies for free-text retrieval based on conceptual cluster- 
ing. The experiments reported here are just a first step towards optimal, interactive 
content retrieval and browsing. We are currently experimenting with shallow Infor- 
mation Extraction techniques (named entity recognition, noun phrase indexing) to 
reach a selection of terms that can be used both to produce better lattice structures and 
as natural descriptors of nodes. 
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Abstract. This document presents and contrasts current efforts at ap- 
plying Formal Concept Analysis (FCA) to some semi structured doc- 
ument collections and file systems in general. Existing efforts are then 
contrasted with ongoing efforts using the libferris Virtual File System 
(VFS) as a base for FCA on file systems. 



1 Introduction 

The file system has become the defacto standard for the storage and management 
of semi structured data on computers. File systems have evolved from presenting 
a list of named objects (files) which contain a contiguous range of bytes into the 
modern file system comprised of a tree structure augmented with soft and hard 
links, sparse hies, extended attributes and transparent support for many on-disk 
storage formats. 

File systems perform many roles including storage of a user’s data as well as 
meta data and configuration settings. When most users consider meta data that 
is stored by a hie system they think of a hle’s size, modification time, access 
permissions etc. While such meta data has been in common use for a very long 
time, modern hie systems allow much more meta data to be stored and retrieved 1 . 
It has become common place for applications to store their configuration settings 
in the user’s home directory under UNIX systems, extending the use of hie 
systems to containing meta data about application instances themselves. 

This paper has two distinct purposes: to survey current literature on the 
application of Formal Concept Analysis (FCA) to hie systems and to present 
libferris 2 and how it offers new possibilities for application of FCA on hie systems. 
It is assumed that the reader is familiar with the concepts of FCA. The common 
notation of the object set G, attribute set M and incidence relation / C G x M 
are used throughout. 

The paper is organized as follows: A survey of preexisting efforts to apply 
FCA to hie systems followed by an examination of the features of the virtual hie 
system libferris that can be used for FCA. A brief conclusion then completes the 
paper. 

1 EA and ACL for Linux website, http://acl.bestbits.at/ 

2 littp:/ /witme. sourceforge.net/libferris. web/ 

P. Eklund (Ed.): ICFCA 2004, LNAI 2961, pp. 88-95, 2004. 

(c) Springer- Verlag Berlin Heidelberg 2004 
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2 Formal Concept Analysis and File Systems 

Ferre and Ridoux present an alternate representation of FCA as Logical Concept 
Analysis (LCA) in [7] where the lattice (G, M, I) has the attributes M replaced 
by an (possibly infinite) lattice of formulas and / attaches formulas to the objects 
in G. 

Applying FCA to file systems can be seen from many perspectives in the 
literature. Much research has been done on applying FCA to text document 
collections [2,11]- FCA has also been applied to more structured data such as 
email [4]. Application to file systems as a whole is covered in [6]. 

There are many limitations of a hierarchical file system model which are ad- 
dressed by using FCA. The most striking being that a tree structure forces logical 
containment of files in a directory and the encoding of meta data about each file 
into its path in the tree [4,6]. This can be eased by use of soft and hard links but 
in so doing the semantics of file access become more complicated (dangling links, 
cycles during link resolution) . Encoding meta data into a file’s path hinders the 
location of conceptually similar documents because a small change in one piece of 
meta data may require one to scan from the root of the tree to find a document. 
Consider the example where paths are created by first encoding the year the 
document was authored and then the general type of document such as audio, 
video or text. If one is browsing “/2003/text/whitepaper/libferris/fcasurvey.tex” 
and wishes to find other libferris whitepapers that were not necessarily authored 
in 2003 then they must try other branches from the root of the tree and scan 
down a similar path from each of those. 

To alleviate the single access path issue many file systems offer the ability 
to find conceptually similar documents by showing the results of a query as a 
file system [8,9]. Such views have the drawback of being read only or allowing 
inconsistency and usually being somewhat separate from the standard navigation 
paths. In moving to the lattice structure of FCA both querying and navigation 
are presented via the same interface and a user can seamlessly switch between 
both styles of interaction [6]. 

In the upgrade from an hierarchical structure to a concept lattice, directories 
become concepts, symbolic links are no longer required and files form the object 
set G in the formal context. There is no requirement for symbolic links because 
FCA allows a file to exist in many concepts at the same time. Because a lattice 
structure allows a concept to have multiple parents it allows objects that are 
conceptually close to each other to be close in the lattice as well [1]. In the above 
example of looking for other libferris whitepapers one would only need to move 
up the lattice to loosen the restriction in the time dimension to see other related 
libferris papers. To gain access to informal documentation on libferris one could 
then navigate upward to remove the whitepaper attribute. 

Ferre and Ridoux [6] generalize the current working directory pwd(l) into 
a history stack of working concepts. This is done to allow one to move to the 
correct parent concept easily. The familiar command cd . . becomes a pop op- 
eration on the history stack or a move to the root concept if the history is 
empty. The semantics of cd . . become more of a navigation backwards than 
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a navigation to the last direct superconcept as detailed in [6]. The change in 
semantics is because the relative or absolute move in the lattice is not broken 
into subparts and pushed individually but each refinement provides a single push 
onto the stack. If one is doing FCA using this working concept stack then one 
could break a navigation into its component attributes and push them as indi- 
vidual refinements. For example, given the working concept “/2003/text” and 
a command cd whitepaper/libf erris the stack could have the two attributes 
pushed onto it in the order presented on the command line. 

Due to concepts being multi parented there can be many paths from the root 
concept to any concept in the lattice. The “parent concept” stack should however 
still be maintained in the order given by the user so that only the last attribute 
in the path is removed by a cd . . command, ie. the particular absolute path 
chosen to find a concept is only relevant to future relative path operations. If 
one can easily strip off the last attribute of a path then one only needs to store 
the path used to reach the current concept to allow relative path operations. 
Presumably such a technique was not chosen for [6] due to the use of formulas 
for attributes in Logical Concept Analysis. 

The Is command is made modal in [6] by separating out the query of the 
extent of a concept (is -r f) from query for the refinements available from a 
working concept (is f). This seems somewhat artificial as the traditional UNIX 
Is command makes no distinction between showing only places to navigate to 
against showing only the files in the current directory. 

Using the working concept one can easily navigate the concept lattice using 
familiar commands cd, Is and pwd modifying the lattice using mv, cp and rm 
as has been done in the Conceptual Shell [6]. Although altering the current 
working directory with cd should be easy enough, explicit detail is not given 
in [6] about how one resolves copy, move and remove operations on objects in 
the lattice. The most challenging operation would be the mv command. Consider 
the case of moving an image from a subconcept of “true colour” to a subconcept 
of “monochrome”. Such an operation would require a lossy transformation to 
occur on the actual image data in order to maintain the semantic consistency of 
the objects in each concept in the lattice. 

A commonly noted distinction is between intrinsic attributes of an object 
which may be mechanically determined from an object’s byte content, and ex- 
trinsic properties which require user interaction to determine. Extraction of in- 
trinsic properties from documents covers many domain specific algorithms such 
as by using the Ripple-Down Rule (RDR) knowledge acquisition and mainte- 
nance methodology from knowledge based systems [11], email headers, regular 
expression matches and machine learning algorithms [4,15], or an arbitrary ex- 
traction function [6]. 

Extrinsic attributes for objects are discussed less than intrinsic. The Con- 
ceptual Email Manager (CEM) [4] uses extrinsic attributes to allow the user to 
override intrinsic attributes to always be true or false for a particular mes- 
sage and also to allow CEM to update attributes such as “mail read” and “new 
mail” [4], 
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The collection of intrinsic and extrinsic attributes is used to form the at- 
tribute set M for FCA. In CEM [4] the attributes create a partial order (M, 
such that the transitivity in the partially ordered attributes is also reflected in 
the relation I of the formal context ( G , M , /), ie. If for an object g £ G and an 
attribute m £ M if glm then V/i £ transitive-parent (to), gl/i. This allows one 
to not only tag files with the most specific attributes but to find them in the 
formal context using less specific attributes relative to ( M , ^). The partial order 
(M, in CEM is edited using a tree widget in which multi parented attributes 
appear under each of their parent attributes in the tree. 

Although Ferre and Ridoux use formulas as their attributes they too apply 
an ordering to their attributes [7,6]. Their formulas are ordered by a possibly 
infinite lattice and they present methods to enable navigation of the concept 
lattice built from these formulas using contextualized logic. 

Modern file systems support Extended Attributes (EAs) which allows arbi- 
trary key- value data to be attached to files and directories. Additional APIs have 
been provided for both UNIX 3 and Microsoft Windows 4 operating systems for 
associating a key-value pair with a file. With EAs an application can store meta 
data about a file with the file itself on disk. One can abstractly consider the 
EAs for a file / as a subdirectory which can not contain directories but only files 
containing meta data about the file /. In this light, a directory can be seen as a 
many valued formal context. Assume that a directory with content G forms the 
objects g using its contents. A file g £ G may have an EA to £ M with value 
w £ W where to and w are strings. Then w is functionally dependent on g and 

TO. 

Enrichments to FCA have been created to allow many valued attributes to be 
used and simple logics over attributes to generate summary attributes. Creating 
a binary relation that can be used as I C G x M can be done by creating 
conceptual scales [5,4,3,14] or using logical scaling [14,13]. Both conceptual and 
logical scaling can be seen as a method to take one or more columns in a many 
valued context and generate one or more new binary columns as the result. 

CEM [4] creates conceptual scales automatically based on the partial order it 
maintains over the attributes in the formal context. A default scale is created 
V/i £ (M, such that is true iff an object g has any of the direct children 

attributes of /t in (M, ^). Using LCA for file systems [6] has a similar setup 
using the lattice of formulas it follows that one formula /z deduces all formula 
below it in the lattice. 

3 libFerris and FCA 

The libferris project was originally created to provide a modern semantic file 
system [8]. A semantic file system differs from a traditional file system in two 
major ways: by allowing the results of a query to be presented as a file system 
and to present interesting meta data from files as key-value attributes. Queries 

3 EA and ACL for Linux website, http://acl.bestbits.at/ 

4 http:/ /linux- ntfs.sf.net/ntfs/attributes/ea_information.html 
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are submitted to the file system embedded in the path and the results of the 
query form the content of the virtual directory. For example, to find all docu- 
ments that have been modified in the last week one might read the directory 
“/query/ (mtime>begin last week)/”. Interesting meta data is extracted from a 
file’s byte content using what are referred to as transducers in [8] . An example of 
a transducer would be a little bit of code that can extract the width of a specific 
image file format. 

The core abstractions in libferris can be seen as the ability to present many 
trees of information overlaid into one namespace, the presentation of key-value 
attributes that files posses, a generic stream interface for file and attribute con- 
tent, indexing services and the creation of arbitrary new files. 

Overlaid trees are presented because one can drill into composite files such 
as XML, ISAM 5 databases or tar files and view them as a file system. The 
overlaying of trees is synonymous with mounting a new file system over a mount 
point on a UNIX machine to join two trees into one globally navigable tree. 
Being able to overlay trees in this fashion allows libferris to provide a single file 
system model on top of a number of heterogeneous data sources 6 . 

Presentation of key- value attributes is performed by either storing attributes 
in kernel level EAs or by creating synthetic attributes who’s values can be dy- 
namically generated and can perform actions when their values are changed. 
Both stored and generated attributes in libferris are referred to simply as EAs. 
Examples of EAs that can be generated include the width and height of an im- 
age, the bit rate of an mp3 file or the MD5 7 hash of a file. For an example of 
a synthetic attribute that is writable consider an image file which has the key- 
value EA width=800 attached to it. When one writes a value of 640 to the EA 
width, for this image then the file’s image data is scaled to be only 640 pixels 
wide. Having performed the scaling of image data the next time the width EA is 
read for this image it will generate the value 640 because the image data is 640 
pixels wide. In this way the differences between generated and stored attributes 
are abstracted from applications. 

Another way libferris extends the EA interface is by offering schema data 
for attributes. Such meta data allows for default sorting orders to be set for 
a datatype, filtering to use the correct comparison operator (integer vs. string 
comparison) , and GUIs to present data in a format which the user will find intu- 
itive. Having the correct comparison operator and sorting order is a prerequisite 
to generating logical scales for an EA. 

Although attaching and interacting with typed arbitrary key-value data on 
files is very convenient in libferris it leaves applications open to interpret the data 
how they choose. For this reason specific EAs have been defined for semantic 
categorization of files on a system wide basis. These EAs allow one to associate 



5 Indexed Sequential Access Method, eg. B-Tree data stores such as Berkeley db. 

6 Some of the data sources that libferris currently handles include; http, ftp, db4, dvd, 
edb, eet, ssh, tar, gdbm, sysV shared memory, LDAP, mbox, sockets, mysql, tdb and 
XML. 

7 MD5 hash function RFC, http://www.ietf.org/rfc/rfcl321.txt 
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files with “emblems” to explicitly capture the intrinsic and extrinsic attributes 
of a file. The set of all emblems £ is maintained in a partial order (£, ^). The 
relation for g, ip € £ of g ^ ip means that /i logically is-a ip. Consider the 
example of marking an image file: one may attach the emblem “sam” to the 
image file. This emblem may have a parent “my friends” to indicate that sam is 
one of my friends. It follows that the image file is also of one of my friends. A 
file can have many emblems attached to it. For best querying results the most 
specific emblems from (£, that are applicable to a file should be attached. 

In a way the partial order of emblems (£, maintains a richer structure 
than the simple directory inclusion used in standard hierarchical file systems. If 
we are to define direct directory inclusion as a relation consider that we have a 
set of documents G arranged into a standard file system tree using the relation 
A C GxG. The relation A is considered as anti-transitive, i.e. xX y, yXz => ~>(xXz). 
The relation A is also not reflexive and is antisymmetric. One can use A to 
represent the normal parent relationship from file systems so: aXb •<=> a is a direct 
parent of b. In normal file systems each object would have only one parent. 

The emblems (£, can be mounted as a file system in libferris and allow 
retrieval of all objects that have an emblem when the leaf emblems in (£, 
are listed. Thus executing Is -1 emblems ://whitepaper/ferris would show 
a list of all files that have been tagged with f erris or any of its transitive parent 
emblems in (£, ^). 

In order to quickly find the set of files that have a given attribute value 
indexing is available on both the full text of files [17] and on their EAs. The 
EA index is maintained in a sorted order to allow the list of files for which 
a comparative constraint on their EA holds. For example width<800 can be 
resolved completely using only the index. 

Given the indexes, files, emblems and attributes from libferris there are many 
ways to create a formal context: logical scaling, file system inclusion and the 
presence of emblems. 

The most mechanical of these is using emblems because a file either has one 
attached or it doesn’t. Logical scaling can be used by supplying a simple boolean 
logic to use on the attribute index, for example, width < 900. There are several 
predicate languages used at current: an extended Lightweight Directory Access 
Protocol (LDAP) search syntax 8 and XPath expressions 9 . 

File System inclusion is a form of logical scaling which when given a file 
system Z will create a new attribute ( in the context. For each g £ G The value 
of glC, will be true iff 3g € Z such that g = g. Using file system inclusion and 
the full text index one can define an attribute in the formal context which is 
true iff objects g £ G satisfy a boolean full text query. For example, if we wish 
to have a new attribute £ indicating if an object satisfies the boolean query 
alice wonderland we first mount the full text query and obtain the file system 
Z showing the files containing these terms and then bind that file system to the 
formal context using file system inclusion 



8 http:/ /www. faqs.org/rfcs/rfc2254.html 

9 XML Path Language (XPath) Version 1.0, http://www.w3.org/TR/xpath 
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K = (G, M U {C}, IU{( 9 , C)|<? £ G,3g £ Z, g = M }) (1) 

One of the main utilities of this scheme is to allow existing virtual file systems 
that libferris can generate based on queries to be leveraged in FCA. 

In a way the use of emblems to categorize ones files combined with the use 
of a properly synchronized index can be seen as a mechanism for cached queries. 
In this light it is possible to allow logical scaling to be performed before it is 
used in FCA. For example, by using a collection of emblems suitably parented 
one can cache numeric intrinsic results. This is done using a chain of emblems 
a £- b c <— d where the description of a is “< 10” and b is “< 5” and so 
on. A script can then be run to operate over a file system tree and assign the 
most specific emblem to each file for latter use in FCA. This does imply that 
the level of quantization captured in the emblem chain is acceptably small when 
the objects are tagged with emblems to allow future use in FCA. 

There are three main possibilities as to who attaches emblems to files: user 
explicit assertion or retraction, use of rigid rule sets (such as width >= 1024 
implies attachment of the medium-resolution- image emblem) and use of Su- 
pervised Machine Learning (SML) algorithms. Automatic attachment has been 
explored using SML algorithms such as Support Vector Models [10] or Bayesian 
networks 10 . The main issue with using such SML algorithms is that they are not 
entirely accurate. See [12] for further examination of emblem attachment and in 
particular automated emblem attachment. 

4 Conclusion 

A survey of the limited work that exists applying Formal Concept Analysis to 
file systems and semi structured data has been presented. A modern semantic 
file system, libferris, which can be used to generate formal contexts suitable for 
FCA has been presented. Discussion of libferris covered the use of EAs to create 
many valued formal contexts and the use of emblems as a direct source of binary 
attributes. Some concerns such as the requirement of indexing were also touched 
on. 

By using a modern semantic file system as the data source for FCA one 
extends the horizons of what they can analyze to include not only structured 
data from relational databases [16], semi structured email data [4] or full text 
documents [2] but data from many sources including directly from the Internet. 
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Abstract. TosCANA-systems have been used in many contexts to vi- 
sualize information found in relational databases. As opposed to many 
other approaches this is done based on conceptual structures rather than 
numerical analysis. While conceptual structures often tend to ease un- 
derstanding of the data by avoiding too much information reduction, 
sometimes a particular reduced information gained by using numerical 
approaches is preferable. This paper discusses how conceptual structures 
and numerical analysis can be combined into conecptual information sys- 
tems using ToscanaJ. 



1 Introduction 

The developement of the theory of conceptual information systems always went 
along with the usage and development of tools, most noticably TOSCANA. As 
the first program of its kind, TOSCANA has a long history of projects where it 
was applied, but it also encountered limitations due to aging effects of its code 
base. About 2 years ago the KVO group 1 started the ToscanaJ project 2 with 
the aim to address these issues with a new architecture. This work is based 
on the experiences from TOSCANA projects, mostly coming from the Technical 
University in Darmstadt and it aims at giving both a professional tool for applied 
conceptual knowledge processing as well as a platform to extend the theory of 
conceptual information systems. 

One area where ToscanaJ offers a range of new features extending the 
existing notion of conceptual information systems is the integration of numerical 
methods. In combination with a relational database management system a range 
of new combinations of conceptual and numerical analysis and visualization can 
be achieved, such as: 

— labels in line diagrams can display numerical aggregates of object sets; 

— classical textual database reporting views can be displayed within the pro- 
gram, thus easing access to larger amounts of numerical analysis data; 

1 http://www.kvocentral.org 

2 http : //toscanaj . sourcef orge . net 

P. Eklund (Ed.): ICFCA 2004, LNAI 2961, pp. 96-103, 2004. 

(c) Springer- Verlag Berlin Heidelberg 2004 
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— charts can be accessed for any object set in a displayed lattice; 

— ratios of object sets can be displayed in comparison to neighbouring sets 

or the full object set within a diagram 

Some of these features have been described in earlier work, most noticably 
by Stumme and Wolff (see e.g. [SW97], [SW98]). This paper tries to give an 
overview of the features in T OSCANA J relating to numerical analysis and to con- 
textualise these in a larger framework based on the ideas of Stumme and Wolff. 
The implementations in ToscanaJ are used as examples and proof of concept 
- unless otherwise stated all the described functionality has been implemented 
and used in conceptual information systems using ToscanaJ. 

The paper is structured as follows: Section 2 gives some background in form 
of an overview of the literature. Section 3 discusses the use of aggregates as 
label contents in a line diagram. Section 4 shows how reporting tools similar to 
standard database analysis tools can be used within a conceptual information 
system. Section 5 discusses how to attach comparisons of extent sizes as labels 
onto the lines. The paper concludes with an outlook for further developments in 
Section 6. 

2 Background 

We assume the reader is familiar with Formal Concept Analysis and the ter- 
minology and notations introduced by Ganter and Wille in [GW99]. We also 
assume that the reader knows about the notion of conceptual information sys- 
tems as defined for example in [HSWWOO]. An overview of ToscanaJ and its 
features can be found in [BH04]. 

While the notion of a conceptual information system has been well elaborated 
in the literature ([HSWWOO], [HS01], [StuOO]), the combination with numerical 
methods is less well researched with the exception of the combination of FCA 
and data mining (e.g. [STB+02]). 

Stumme and Wolff have presented two distinct ideas in some of their papers 
([SW97], [SW98]): the use of aggregating functions to extend conceptual infor- 
mation systems to integrate so-called relational structures as extension to the 
idea of many-valued contexts and the use of Pearson’s y 2 -measure to determine 
the dependency of two scales. 

Data mining and FCA are supported by a range of tools like the Titanic 
algorithm ([STB+02]) and different programs. A tool called Chianti was written 
by Stumme and Hereth ([HSWWOO]) to investigate different measures for the 
distance of scales 3 . Other aspects like the combination of aggregating functions 
with conceptual systems have been described in research papers, but to the 
knowledge of the author never been implemented. 

This paper will discuss some of these combinations and how they are imple- 
mented in the ToscanaJ program. Beside giving a proof of concept with the 
implementation, some additional notions like the integration of reporting tools 

3 note that we use “distance” in a wider sense than metric. 
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will be discussed to extend the notion of a conceptual information system into 
a broader data analysis technique. 

3 Displaying Aggregates within the Line Diagrams 

TOSCANA 2 and 3 have two basic options for aggregation - instead of just al- 
lowing the user to see the lists of the items in the extens or object contingents, 
they also allow one to show the object counts either as absolute numbers or 
as distribution, that is, relative to the size of the current object set. For many 
conceptual information systems created this was sufficient, but for other systems 
it turned out to be a limitation since other aggregates were considered useful in 
their context but not easily available. 

While using the same default options for the labels, TOSCANA J tries to give 
the Conceptual System Engineer more flexibility by allowing the definition of 
additional label contents in the CSX file, which contains the schema for the 
conceptual system in XML format. Thus the user of the system will be offered 
more options for the labels if the corresponding schema is loaded. These options 
can be either lists or aggregates. Since we are interested in the numerical aspects 
only the latter will be discussed here. 

If labels should display aggregates, the CSX file has to contain a definition 
of an <aggregateQuery>, which contains at least one SQL fragment defining 
the aggregate to use. All aggregates are defined on the SQL level, which gives 
a great level of flexibility and good performance since the aggregation happens 
in the database engine. Most database engines also allow writing extensions, in 
this way new aggregates can be introduced. By not using a separate aggregation 
system, performance and a clean abstraction is achieved. 

Not only can the Conceptual System Engineer use any aggregate defined in 
the underlying database management system, TOSCANA J also offers two orthog- 
onal extensions: 

— aggregates can be formatted (e.g. as currencies) and multiple aggregates can 
be combined into a single label; and 

— the results can be viewed as relative numbers compared to the same aggre- 
gate on all objects in the current diagram. 

These extra options can be illustrated using two examples. The first one 
gives the price range of the objects in question by querying the minimum and 
maximum of the PRICE column and displaying it in a formatted way as shown in 
Fig. 1 on the left. A result of this query can be seen to the right. The definition 
is given only structurally to avoid the overhead of the XML syntax. In the CSX 
file the indentation will be given as nested XML elements with attributes. The 
name is used for menu entries. 

Displaying the aggregate relative to the same aggregate on the current object 
set is a generalization of the distribution labels from TOSCANA 2/3. There the 
number of objects in a certain set (extent/contingent) is displayed relative to the 
full number of objects. A similar approach can be taken with other aggregates, 
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aggregateQuery 
name = "Price 
queryField 
format = 

query = 

separator = 
queryField 
format = 

query = 



Range " 

"$ 0 . 00 " 
MIN (price) 

II _ II 



"$ 0 . 00 " 
MAX (price) 




[Tower] / \ | Desktop] 

| Mini-Towe?! |Slimline| 

\ I $2665.00- $4529.001 ' | $2260.00 $541 5.00| / 

5 0^0 

| $2895.00- $4495.00] |$2721 .00 - $4665.00| J $2675.00 - $4205.00| 




Fig. 1. Querying a range from a database column 



although of course the interpretation changes. For example the query shown 
in Fig. 2 queries the average price once as normal query, and then again as a 
relative value, compared to the average on the current object set. The value of 
98.93% on the right in the resulting diagram means that the average price of the 
objects in the contingent is 1.07% below the average of the objects displayed in 
the diagram (either the full object set or the set reduced by filtering). 

These features of using SQL aggregates, being able to combine them and the 
relative views allow a conceptual information system to be customized to display 
very specific information useful for it. The Conceptual System Engineer can mix 
numerical analysis into the conceptual information system in a suitable fashion 
to enhance the ease of use and usefulness of the system created. 



aggregateQuery 
name= " Average 
queryField 
query = 
format = 
separator = 
queryField 
query = 
relative = 
format = 
separator = 



Price (relative) 

AVG (price) 

"$ 0 . 00 " 

II ^ II 

AVG(price) 

true 




[Tower ] \/ \ ] Desktop] 

| Mini-Tower] ^J^^P ^TSmall-footprinTI ^^^^^ | Slimline | 

\ |$ 3516.00 (99.18%)| 1 1$ 3527.00 (99.49 %)| / 

0) Q )3 

|$ 3496.00 (98.62 %)| |$ 3658.00 (103.19 %)| J$ 3507.00 (98.93 %)| 



"0.00 °/o" 
II ^ II 



Fig. 2. Querying an average value as absolute and relative numbers 



4 Using Conceptual Structures 
to Access Numerical Reports 

If more than just simple aggregates are needed to be shown to the user, displaying 
this information directly in the line diagram is not feasible anymore. In this case 
additional windows are needed that can be shown on request. ToscanaJ follows 
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Fig. 3. Calling a bar chart to display numerical information about an extent 



the tradition of the older TOSCANA systems by attaching this operation to the 
object labels - since the information is about the object sets, this is a well 
accepted approach. 

There are two distinctive popup windows the user can open: the first are 
views on single items, which provide information about a particular object in 
a set. To access this object-specific information, a particular object has to be 
selected from a list. As opposed to this, reports showing information about a full 
object set can be selected on any type of label. 

Obviously the reports allow integration with aggregates and other numerical 
facilities. TOSCANA 2 and 3 used MS Access forms and HTML templates to 
do the reporting, while ToscanaJ offers a plugin interface with a number of 
implementations. One of them uses a similar approach to TOSCANA 3 and uses 
HTML templates to create reports. As opposed to Toscana 3, ToscanaJ can 
show these as windows within the main application - TOSCANA 3 had to start 
an external browser. The internal approach allows better use of the screen real 
estate. 

ToscanaJ exceeds the functionality of the older programs in its ability to 
access charts for the object sets in a lattice. Due to the integration with the 
JFreeChart rendering engine 4 , a range of different charts can be displayed within 
the framework of the scales. This includes line charts, bar charts, pie charts, 
area charts, scatter plots and many more. The charts are highly configurable 
and additional information like moving averages can be added into the charts. 

Fig. 3 is a screenshot of ToscanaJ showing a bar chart of benchmark data 
for a particular extent. Via the context menu of the leftmost object label the 



http : //jfree . org/jf reechart 
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chart option configured in the CSX file has been selected, which opens a chart 
created by JFreeClrart. This chart shows the results for three different types of 
benchmark in a quick overview. For this type of information a chart is far more 
efficient than a conceptual scale and the integration allows integrating typical 
views from existing data analysis approaches into a conceptual information sys- 
tem. 

This combination of conceptual and numerical data analysis allows the user to 
combine the two approaches to synergetic effects. While the conceptual structure 
gives the guidance to identify the objects of interest, the numerical summary 
information in the reports - either in textual or in chart form - allows quickly 
identifying certain characteristics of the selected object set. These may then in 
turn lead to more refinement within the conceptual information system. 



5 Looking at Object Frequencies 

So far we have looked at numerical information we display based on single object 
sets - with the exceptions of the relative aggregates where the values found get 
normalized. Sometimes it is interesting to look at the aggregates for different 
sets and how they compare to each other. In our work on ToscanaJ we have 
so far investigated only the comparison of the object counts with each other, 
since this is the simplest case. Some of this work should be extended to other 
aggregates. 

One of our ideas of visualizing the ratios of object counts is to look at the 
extent ratios of neighbouring concepts. This measure gives a notion close to 
the notion of “support” in data mining. The extent ratio between two concepts 
C\ >- C 2 tells how many objects which have all attributes in the intent of Cj 
will also have the additional attributes in the intent of C 2 . 

These ratios can be visualized in ToscanaJ as shown in Fig. 4. This example 
shows how easy it is to read information like “about 80% of the PCs are available 
via direct sales” or “about two thirds of the PCs available in shops are also 
available via direct sales”. Since this type of information is about a ratio of 
extents the information is easier to access on the edges. 

6 Outlook and Conclusion 

We have shown a number of different ways to combine the structuring power 
of conceptual information systems with the information reduction of numerical 
approaches. Since the ToscanaJ features presented in the paper are mostly set 
up for specific conceptual information systems, the numerical aspects can be 
applied in a well-targeted fashion, based not only on the informational structure 
found in the data source, but also on the abilities of the end-user. 

ToscanaJ is an ongoing project and it is likely that more features in the 
direction described here will be added in the near future. Another planned step 
to allow more flexibility and power in configuring tailored information systems 
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Fig. 4. Labels on the edges denote extent ratios 



is the unification of the different views used for displaying different types of 
information. 

Except for the labels on the lines all views visualize certain aspects of the 
concepts. This includes not only the object and attribute labels and the sepa- 
rate pop-up windows as used by the reporting tools, it also includes the nodes 
themselves since in TOSCANA J they do not just denote the concepts’ positions 
as in a classical line diagram, they also convey information about the concepts. 
In the standard setting this is a color gradient hinting at the extent size. 

All these views - labels, pop-ups and nodes - can be seen in some sense as 
interchangable. Most interesting seems to be the idea of replacing the standard 
nodes with more informative structures like charts or UML-like boxes showing 
attribute and object contingents. Furthermore TOSCANA J could allow an ar- 
bitrary number of labels per concept and the notion of pop-up windows could 
be unified with the labels by maintaining a connector as long as the pop-up is 
visible. This would highly increase readability when multiple pop-up windows 
are open. 

Such a refactoring of TOSCANA J would give great flexibility to advanced 
Conceptual System Engineers. While we still propose to keep the default config- 
urations rather simple for ease of use, the option to tailor the type and amount 
of information displayed in a particular TOSCANA J system seems to be a very 
worthwhile goal to pursue - both for research purposes as well as discovering 
new ways of applying conceptual information systems in real world projects. 
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Abstract. Over the last two decades a number of tools have been developed to 
support the application of Formal Concept Analysis (FCA) to a wide variety of 
domains. This paper presents an overview of tool support for FCA. 



1 Introduction 

Over the last two decades a number of tools have been developed to support the applica- 
tion of Formal Concept Analysis (FCA) to a wide variety of domains. These tools range 
from the early DOS-based implementations of Duquenne’s GLAD tool to Java-based 
tools currently under active development like ConExp and ToscanaJ. Both commercial 
and open-source software appears in the list which also includes general-purpose and 
application specific tools. 

The next section of the paper introduces the general purpose tools. Application 
specific tools are then discussed in Section 3 before Section 4 concludes the paper. 



2 FCA Tools 

Duquenne’s tool for General Lattice Analysis and Design (GLAD) is possibly the earliest 
software tool that facilitates the analysis of formal concept lattices [1]. GLAD is a DOS- 
based program written in FORTRAN that has been under development since 1983. The 
tool facilitates the editing, drawing, modifying, decomposing and approximation of finite 
lattices in general and is not restricted to the analysis of concept lattices. The lattices to be 
analysed can be derived from abstract mathematics or applied statistics using techniques 
like Analysis of Variance. Single-valued data can also be analysed by exploiting the 
classic correspondence between lattices and binary relations identified by Birkhoff. 

GLAD contains a large number of features, many of which are undocumented and it 
also supports “scenarios” which represent a form of macro. These scenarios can be used 
regenerate and manipulate a lattice by recalling the list of commands used to construct 
it. Diagrams can also be output directly from GLAD in the Hewlett Packard Graphics 
Language (HPGL) 1 — a vector based language designed for plotters. 

Conlmp (Contexts and Implications) is another DOS-based tool implemented by 
Peter Burmeister [2] who started development in 1986 on an Apple II computer. While 
Conlmp is purely text based and provides no graphical output for lattices it also supports 

1 See http://www.piclist.com/techref/language/hpgl.htm 

P. Eklund (Ed.): ICFCA 2004, LNAI 2961, pp. 104-1 11, 2004. 
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a wide range of features for manipulating contexts and provides concept listings which 
can be used for drawing line diagrams by hand. 

The Duquenne-Guigues-base represents a canonical base of valid implications for a 
given context and this is computed and used extensively within Conlmp. Interactive at- 
tribute exploration is supported which can be used to derive both the Duquenne-Guigues- 
base and a typical set of objects. In addition, a three-valued logic that allows for true, 
false and unknown values can also be used. 

While Conlmp supports single-valued contexts another tool called MBA (possibly 
from the German for “Many-valued FCA”: “Mehrwertige BegriffsAnalyse ") can be used 
to scale and pre-process many-valued contexts 2 . In addition, contexts can be exported 
from Conlmp in the so called “Burmeister Format” (‘.CXT’) and rendered using another 
DOS-based tool called Diagram [3], The use of separate tools for the tasks of data- 
preparation, context creation, and line diagram rendering ( Diagram ) is also reflected in 
the classic FCA tools Anaconda and TOSCANA. 

Anaconda and TOSCANA (TOolS of Concept ANAlysis) are tools used for building 
conceptual knowledge systems on top of data stored in relational databases. A conceptual 
system engineer uses knowledge from a domain expert to create queries in the form of 
conceptual scales using a conceptual system editor. These scales essentially capture the 
expert’s knowledge and the information is stored in a conceptual system file. A user can 
then exploit the conceptual scales to retrieve or analyse data from the database using 
a conceptual schema browser [4] . In traditional TOSCANA systems Anaconda is the 
conceptual system editor, TOSCANA is the conceptual system browser, and the data is 
stored in a Microsoft Access database. 

Anaconda is a tool for the creation and editing of contexts, line-diagrams and scales. 
The context, scales and line-diagrams are saved in a conceptual schema file which is 
then used by TOSCANA to analyse the data in the database. While TOSCANA users 
cannot create new scales, the scales can be composed to produce nested line diagrams. 
There are three versions of TOSCANA based on Vogt’s C++ FCA libraries [5,6] and 
more recently a Java-based version — ToscanaJ. 

ToscanaJ [4] is a platform-independent implementation of TOSCANA that supports 
nested line diagrams, zooming and ideal/filter highlighting. Originally part of the Tockit 
project 3 — an open source effort to produce a framework for conceptual knowledge 
processing in Java - ToscanaJ is now a separate project 4 . 

In the context of the workflow described earlier, ToscanaJ represents the conceptual 
schema browser and the conceptual system editor role is filled by two tools — Siena 
and Elba. The two tools can be seen as Anaconda replacements that are both used for 
preparing contexts and scales, however, each represents a different workflow. Elba is 
used for building ToscanaJ systems on top of relational databases while Siena allows 
contexts to be defined using a simple point and click interface. 

ToscanaJ can be used to analyse data in relational databases via ODBC (Open 
Database Connectivity )/JDBC (Java Database Connectivity) or, alternatively, an em- 
bedded relational database within ToscanaJ can be used. Line diagrams can also be 



2 See http://www.mathematik.tu-darmstadt.de/ags/agl/Software/ 

3 See http://tockit.sourceforge.net/ 

4 See http://toscanaj.sourceforge.net/ 
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exported in a variety of raster and vector-based formats including Portable Network 
Graphics (PNG), Joint Photographic Expert Group (JPEG), Encapsulated PostScript 
(EPS), Portable Document Format (PDF), and Scalable Vector Graphics (SVG). 

An XML-based conceptual schema file (.CSX) is used to store the context and scales 
produced by Siena and Elba. In addition, an extensible viewer interface allows custom 
views to be defined as well as allowing external data viewers to be specified. This feature 
is exploited by the formal specification browser SpecTrE described in Section 3.2. 

The line diagrams in Siena and Elba use an n-dimensional layout algorithm in which 
each attribute in the purified context is assigned to a vector [7]. The layout is then 
projected onto the Cartesian plane using standard parallel projection and the approach 
is based on the algorithm used in Cernato. 

Cernato is a commercial FCA tool developed by Navicon 5 that combines some of 
the features of Anaconda and TOSCANA into a single tool. Users are presented with a 
familiar spreadsheet-like interface for creating contexts and data can be imported and 
exported in Comma Separated Value (CSV) format which facilitates the analysis of data 
from genuine spreadsheet applications. 

Line diagrams are constructed incrementally in Cernato and the layout is animated 
by default. Zooming and the construction of scales, which are known as “views” in 
Cernato, are also supported, however, nested line-diagrams are not. In addition to the 
CSV import/export facility a custom XML format can also be used. Furthermore, line 
diagrams can be exported in a number of raster-based image formats, contexts can be 
saved as HTML-tables and Cernato is also able to export complete TOSCANA systems. 

ConExp (Concept Explorer) 6 is another Java-based, open-source FCA project. Like 
Cernato, ConExp combines context creation and visualisation into a single tool. While 
ConExp does not support database connectivity, contexts can be imported and exported in 
Conlmp’s ‘.CXT’ format. A number of lattice layout algorithms can be selected including 
chain decomposition and spring-force algorithms. The line diagrams also support various 
forms of highlighting including ideal, filter, neighbour and single concept highlighting 
and can be exported in JPEG or GIF format. 

ConExp currently implements the largest set of operations from Ganter and Wille’s 
FCA book [ 8] including calculation of association rules and the Duquenne-Guigues-base 
of implications. Interactive attribute exploration is also supported and the context can 
display the arrow relations g /*m and gy/m. 

GaLicia 7 , the Galois Lattice Interactive Constructor is another Java-based FCA tool 
that provides both context creation and visualisation facilities [9], GaLicia’s heritage 
lies in a series of incremental data mining algorithms originally entitled the GAlois 
Lattice-based Incremental Closed Itemset Approach and also a trie data-structure 
based version called Galicia-T. These incremental algorithms were used for mining 
association rules in transaction databases [10,1 1] and form the basis for the incremental 
construction of lattices in GaLicia. 

Both single and many-valued contexts can be analysed in GaLicia. In addition, 
binary relationships between objects can also be described via a context and stored 



5 See http://www.navicon.de 

6 See http://sourceforge.net/projects/conexp 

7 See http://www.iro.umontreal.car valtchev/galicia/ 




